SlideShare ist ein Scribd-Unternehmen logo
1 von 6
Comparing Naïve Bayesian and k-NN
   algorithms for automatic email classification
                                             Louis Eisenberg
                                       Stanford University M.S. student
                                               PO Box 18199
                                             Stanford, CA 94309
                                                650-269-9444
                                       louis@stanfordalumni.org




ABSTRACT                                                 many users are the beneficiaries of machine
                                                         learning algorithms that attempt to distinguish
                                                         spam from non-spam (e.g. SpamAssassin [2]). In
The problem of automatic email classification            contrast to the relative simplicity of spam
has numerous possible solutions; a wide variety          filtering – a binary decision – filing messages
of natural language processing algorithms are            into many folders can be fairly challenging. The
potentially appropriate for this text classification     most prominent non-commercial email classifier,
task. Naïve Bayes implementations are popular            POPFile, is an open-source project that wraps a
because they are relatively easy to understand           user-friendly interface around the training and
and     implement,       they    offer   reasonable      classification of a Naïve Bayesian system. My
computational efficiency, and they can achieve           personal experience with POPFile suggests that it
decent accuracy even with a small amount of              can achieve respectable results but it leaves
training data. This paper seeks to compare the           considerable room for improvement. In light of
performance of an existing Naïve Bayesian                the conventional wisdom in NLP research that k-
system, POPFile [1], to a hand-tuned k-nearest           NN classifiers (and many other types of
neighbors system. Previous research has                  algorithms) should be able to outperform a Naïve
generally shown that k-NN should outperform              Bayes system in text classification, I adapted
Naïve Bayes in text classification. My results fail      TiMBL [3], a freely available k-NN package, to
to support that trend, as POPFile significantly          the email filing problem and sought to surpass
outperforms the k-NN system. The likely                  the accuracy obtained by POPFile.
explanation is that POPFile is a system
specifically tuned to the email classification task      DATA
that has been refined by numerous people over a
period of years, whereas my k-NN system is a
crude attempt at the problem that fails to exploit       I created the experimental dataset from my own
the full potential of the general k-NN algorithm.        inbox, considering the more than 2000 non-spam
                                                         messages that I received in the first quarter of
INTRODUCTION                                             2004 as candidates. Within that group, I selected
                                                         approximately 1600 messages that I felt
                                                         confident classifying into one of the twelve
Using machine learning to classify email                 “buckets” that I arbitrarily enumerated (see Table
messages is an increasingly relevant problem as          1). I then split each bucket and allocated half of
the rate at which Internet users receive emails          the messages to the training set and half to the
continues to grow. Though classification of              test set.
desired messages by content is still quite rare,
As input to POPFile, I kept the messages in                        necessary operations. To train the classifier I fed
Eudora mailbox format. For TiMBL, I had to                         the mbx files (separated by category) directly to
convert each message to a feature vector, as                       the provided utility script insert.pl. For testing, I
described in section 3.                                            split each test set mbx file into its individual
                                                                   messages, then used a simple Perl script fed the
                                                                   messages one at a time to the provided script
Cod    Size*                     Description                       pipe.pl, which reads in a message and outputs the
 e       *      academic events, talks, seminars, etc.             same message with POPFile’s classification
ae     86
bslf   63       buy, sell, lost, found                             decision prepended to the Subject header and/or
c      145      courses, course announcements, etc.                added in a new header called X-Test-
hf     43       humorous forwards                                  Classification. After classifying all of the
na     37       newsletters, articles                              messages, I ran another Java program,
                personal                                           popfilescore, to tabulate the results and generate
p      415
                politics, advocacy
                                                                   a confusion matrix.
pa     53
se     134      social events, parties
s      426      sports, intramurals, team-related
ua     13       University administrative                          k-NN
w      164      websites, accounts, e-commerce, support
wb     36       work, business                                     To implement my k-NN system I used the
                                  * - training and test combined   Tilburg Memory-Based Learner, a.k.a. TiMBL. I
             Table 1. Classification buckets                       installed and ran the software on various Unix-
                                                                   based systems. TiMBL is an optimized version of
                                                                   the basic k-NN algorithm, which attempts to
                                                                   classify new instances by seeking “votes” from
POPFILE                                                            the k existing instances that are closest/most
                                                                   similar to the new instance. The TiMBL
POPFile implements a Naïve Bayesian algorithm.                     reference guide [5] explains:
Naïve Bayesian classification depends on two
crucial assumptions (both of which are results of                  Memory-Based Learning (MBL) is based on the
the single Naïve Bayes assumption of conditional                   idea that intelligent behavior can be obtained by
independence among features as described in                        analogical reasoning, rather than by the
Manning and Schutze [4]): 1. each document can                     application of abstract mental rules as in rule
be represented as a bag of words, i.e. the order                   induction and rule-based processing. In
and syntax of words is completely ignored; 2. in                   particular, MBL is founded in the hypothesis that
a given document, the presence or absence of a                     the extrapolation of behavior from stored
given word is independent of the presence or                       representations of earlier experience to new
absence of any other word. Naïve Bayes is thus                     situations, based on the similarity of the old and
incapable of appropriately capturing any                           the new situation, is of key importance.
conditional dependencies between words,
guaranteeing a certain level of imprecision;                       Preparing the messages to serve as input to the k-
however, in many cases this flaw is relatively                     NN algorithm was considerably more difficult
minor and does not prevent the classifier from                     than in the Naïve Bayes case. A major challenge
performing well.                                                   in using this algorithm is deciding how to
                                                                   represent a text document as a vector of features.
To train and test POPFile, I installed the software                I chose to consider five separate sections of each
on a Windows system and then used a                                email: the attachments, the from, to and subject
combination of Java and Perl to perform the                        headers, and the body. For attachments each
feature was a different file type, e.g. jpg or doc.            overlap (basic equals or not equals for
For the other four sections, each feature was an               each feature), modified value difference
email address, hyperlink URL, or stemmed and                   metric (MVDM), and Jeffrey divergence
lowercased word or number. I discarded all other
                                                           •   d, the class vote weighting scheme for
headers. I also ignored any words of length less
                                                               neighbors; this can be simple majority (all
than 3 letters or greater than 20 letters and any
                                                               have     equal    weight)    or   various
words that appeared on POPFile’s brief
                                                               alternatives, such as Inverse Linear and
stopwords list. All together this resulted in each
                                                               Inverse Distance, that assign higher
document in the data set being represented as a
                                                               weight to those neighbors that are closer
vector of 15,981 features. For attachments,
                                                               to the instance
subject, and body, I used tf.idf weighting
according to the equation:                              For distance metrics, MVDM and Jeffrey
                                                        divergence are similar and, on this task with its
                                                        numeric feature vectors, both clearly preferable
weight(i,j) = (1+log(tfi,j))log(N/dfi) iff tfi,j ≥ 1,
                                                        to basic overlap, which draws no distinction
                                                        between two values that are almost but not quite
where i is the term index and j is the document         equivalent and two values that are very far apart.
index. For the to and from fields, each feature         The other options have no clearly superior setting
was a binary value indicating the presence or           a priori, so I relied on the advice of the TiMBL
absence of a word or email address.                     reference guide and the results of my various trial
The Java program mbx2featurevectors parses the          runs.
training or test set and generates a file containing
all of the feature vectors, represented in TiMBL’s      RESULTS/CONCLUSIONS
Sparse format.
                                                        The confusion matrices for POPFile and for the
TiMBL processes the training and test data in
                                                        most successful TiMBL run are reproduced in
response to a single command. It has a number of
                                                        Tables 2 and 3. Figure 4 compares the accuracy
command-line       options  with      which    I
                                                        scores of the two algorithms on each category.
experimented in an attempt to extract better
                                                        Table 5 lists accuracy scores for various
accuracy. Among them:
                                                        combinations of TiMBL options. The number of
    •   k, the number of neighbors to consider          TiMBL runs possible was limited considerably
        when classifying a test point: the              by the length of time that each run takes – up to
        literature suggests that anywhere between       several hours even on a fast machine, depending
        one and a handful of neighbors may be           greatly on the exact options specified.
        optimal for this type of task
    •   w, the feature weighting scheme: the
        classifier attempts to learn which features
        have more relative importance in
        determining the classification of an
        instance; this can be absent (all features
        get equal weight) or based on information
        gain or other slight variations such as gain
        ratio and shared variance
    •   m, the distance metric: how to calculate
        the nearness of two points based on their
        features; options that I tried included
ae   bs   c    hf   na   p       pa   se   s       ua   w    wb
                                                                                 Table 2. Confusion matrix for best TiMBL run
ae    3    0    0    0    0       1    0   25   14       0   0     0
bs    0    5    0    0    0       3    0    4   19       0   0     0
c     0    1   38    0    0   12       0    8   13       0   0     0        ae   bs   c       hf   na   p        pa   se   s        ua   w    wb
hf    0    1    0    5    0   10       0    0       5    0   0     0 ae     38    0       1    0    0        0    0    0       2     0    2    0
na    1    1    0    0    5   11       0    0       0    0   0     0 bs      0   10       0    0    0        0    0    0       21    0    0    0
p     0    0    0    2    0   189      0    0   15       0   1     0         8    3   51       0    0        4    1    0       2     1    0    0
pa    0    0    0    0    0       2   13    6       5    0   0     0         0    0       0    7    0        7    1    1       4     0    0    0
se    0    2    0    1    0       8    0   27   29       0   0     0 na      0    0       0    1   32        0    0    0       0     0    0    0
s     0    1    0    0    0   28       0    6   178      0   0     0         0   10       3    8    0   140       2    7       20    0    4    4
ua    0    0    0    0    0       1    0    0       0    5   0     0 pa      3    1       0    0    0        0   18    0       2     0    1    0
w     2    0    0    0    0   41       0    0   12       0   27    0 se      0    5       2    1    0        3    0   33       20    0    0    0
w
b     0    0    0    0    0   18       0    0       0    0   0     0         0   14       3    2    0       15    0    2   173       0    0    3
                                                                       ua
                                                                             0    0       0    0    0        0    0    0       0     6    0    0
                                                                             1    0       7    0    0        4    1    2       4     2   59    0
                                                                       wb    0    0       0    1    0        2    0    0       0     0    0   14


                                                                                      Table 3. Confusion matrix for POPFile


                                                                            As the tables and figure indicate, POPFile clearly
                                                                            outperformed even the best run by TiMBL.
                                                                            POPFile’s overall accuracy was 72.7%,
                                                                            compared to only 61.1% for the best TiMBL
                                                                            trial. In addition, POPFile’s accuracy was well
                                                                            over 60% in almost all of the categories; by
                                                                            contrast, the k-NN system only performed well in
                                                                            three categories. Interestingly, it performed best
                                                                            in the two largest categories, personal and sports
                                                                            – in fact, it was more accurate than POPFile.
                                                                            Apparently it succeeded in distinguishing those
                                                                            categories from the rest of the buckets and from
                                                                            each other, but failed to pick up on most of the
                                                                            other important differences across buckets.
m            w        k          d          accuracy
 wb                                                          MVDM      gain ratio      9   inv. dist.          51.0%
  w                                                          overlap   none            1   majority            54.9%
  ua                                                         overlap   inf. gain      15   inv. dist.          53.7%
   s
                                                             MVDM      shared var      3   inv. linear         61.1%
  se
                                                             Jeffrey   shared var      5   inv. linear         60.2%
  pa                                               TiMBL     overlap   shared var      9   inv. linear         58.9%
  p
  na
                                                   POPFile   MVDM      gain ratio     21   inv. dist.          49.4%
                                                             MVDM      inf. gain       7   inv. linear         57.4%
  hf
                                                             MVDM      shared var      1   inv. dist.          61.0%
  c
  bs
                                                             MVDM      shared var      5   majority            54.6%
  ae
                                                                       Table 4. Sample of TiMBL trials
   0%     20%      40%     60%      80%     100%

           Figure 1. Accuracy by category                OTHER RESEARCH

The various TiMBL runs provide evidence for a            A vast amount of research already exists on this
few minor insights about how to get the most out         and similar topics. Some people, e.g. Rennie et al
of the k-NN algorithm. The overwhelming                  [6], have investigated ways to overcome the
conclusion is that shared variance is far superior       faulty Naïve Bayesian assumption of conditional
to the other weighting schemes for this task.            independence. Kiritchenko and Matwin [7] found
Based on the explanation given in the TiMBL              that support vector machines are superior to
documentation, this performance disparity is             Naïve Bayesian systems when much of the
likely a reflection of the ability of shared             training data is unlabeled. Other researchers have
variance (and chi-squared, which is very similar)        attempted to use semantic information to improve
to avoid a bias toward features with more values         accuracy [8].
– a significant problem with gain ratio. The             In addition to the two models discussed in this
results also suggest that k should be a small            paper, there exist many other options for text
number – the highest values of k gave the worst          classification:   support     vector    machines,
results. The effect of the m and d options is            maximum entropy and logistic models, decision
unclear, though simple majority voting seems to          trees and neural networks, for example.
perform worse than inverse distance and inverse
linear.                                                  REFERENCES
It is also important to recognize the impact of the
original construction of the feature vectors.            [1] POPFile: http://popfile.sourceforge.net
Perhaps the k-NN system’s poor performance               [2] SpamAssassin: http://www.spamassassin.org
was a result of unwise choices in                        [3] TiMBL: http://ilk.kub.nl/software.html#timbl
mbx2featurevector: focusing on the wrong                 [4] Manning, Christopher and Hinrich Schutze. Foundations of
headers, not parsing symbols and numbers as                  Statistical Natural Language Processing. 2000.
elegantly as possible, not trying a bigram or            [5] TiMBL reference guide:
trigram model on the message body, choosing a                http://ilk.uvt.nl/downloads/pub/papers/ilk0310.pdf
poor tf.idf formula, etc.                                [6] Jason D. M. Rennie, Lawrence Shih, Jaime Teevan and
                                                             David R. Karger. Tackling the Poor Assumptions of Naive
                                                             Bayes Text Classifiers. Proceedings of the Twentieth
                                                             International Conference on Machine Learning. 2003.
[7] Svetlana Kiritchenko and Stan Matwin. Email classification    [8] Nicolas Turenne. Learning Semantic Classes for Improving
    with co-training. Proceedings of the 2001 conference of the       Email Classification. Biométrie et Intelligence. 2003.
    Centre for Advanced Studies on Collaborative Research.
    2001.

Weitere ähnliche Inhalte

Andere mochten auch

Content - Based Recommendations Enhanced with Collaborative Information
Content - Based Recommendations Enhanced with Collaborative InformationContent - Based Recommendations Enhanced with Collaborative Information
Content - Based Recommendations Enhanced with Collaborative InformationAlessandro Liparoti
 
Naive Bayesian Text Classifier Event Models
Naive Bayesian Text Classifier Event ModelsNaive Bayesian Text Classifier Event Models
Naive Bayesian Text Classifier Event ModelsDKALab
 
Overview of recommender system
Overview of recommender systemOverview of recommender system
Overview of recommender systemStanley Wang
 
How to Build Recommender System with Content based Filtering
How to Build Recommender System with Content based FilteringHow to Build Recommender System with Content based Filtering
How to Build Recommender System with Content based FilteringVõ Duy Tuấn
 
How to build a Recommender System
How to build a Recommender SystemHow to build a Recommender System
How to build a Recommender SystemVõ Duy Tuấn
 
Introduction to text classification using naive bayes
Introduction to text classification using naive bayesIntroduction to text classification using naive bayes
Introduction to text classification using naive bayesDhwaj Raj
 
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via RandomizationICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via RandomizationHidekazu Oiwa
 
SGD+α: 確率的勾配降下法の現在と未来
SGD+α: 確率的勾配降下法の現在と未来SGD+α: 確率的勾配降下法の現在と未来
SGD+α: 確率的勾配降下法の現在と未来Hidekazu Oiwa
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Dev Sahu
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for ClassificationPrakash Pimpale
 
Classification with Naive Bayes
Classification with Naive BayesClassification with Naive Bayes
Classification with Naive BayesJosh Patterson
 
Recommender system introduction
Recommender system   introductionRecommender system   introduction
Recommender system introductionLiang Xiang
 
2013-1 Machine Learning Lecture 03 - Naïve Bayes Classifiers
2013-1 Machine Learning Lecture 03 - Naïve Bayes Classifiers2013-1 Machine Learning Lecture 03 - Naïve Bayes Classifiers
2013-1 Machine Learning Lecture 03 - Naïve Bayes ClassifiersDongseo University
 
Recommender system algorithm and architecture
Recommender system algorithm and architectureRecommender system algorithm and architecture
Recommender system algorithm and architectureLiang Xiang
 

Andere mochten auch (17)

Naive Bayes Presentation
Naive Bayes PresentationNaive Bayes Presentation
Naive Bayes Presentation
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
 
Content - Based Recommendations Enhanced with Collaborative Information
Content - Based Recommendations Enhanced with Collaborative InformationContent - Based Recommendations Enhanced with Collaborative Information
Content - Based Recommendations Enhanced with Collaborative Information
 
Naive Bayesian Text Classifier Event Models
Naive Bayesian Text Classifier Event ModelsNaive Bayesian Text Classifier Event Models
Naive Bayesian Text Classifier Event Models
 
Overview of recommender system
Overview of recommender systemOverview of recommender system
Overview of recommender system
 
How to Build Recommender System with Content based Filtering
How to Build Recommender System with Content based FilteringHow to Build Recommender System with Content based Filtering
How to Build Recommender System with Content based Filtering
 
How to build a Recommender System
How to build a Recommender SystemHow to build a Recommender System
How to build a Recommender System
 
Introduction to text classification using naive bayes
Introduction to text classification using naive bayesIntroduction to text classification using naive bayes
Introduction to text classification using naive bayes
 
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via RandomizationICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
 
SGD+α: 確率的勾配降下法の現在と未来
SGD+α: 確率的勾配降下法の現在と未来SGD+α: 確率的勾配降下法の現在と未来
SGD+α: 確率的勾配降下法の現在と未来
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier
 
Lecture10 - Naïve Bayes
Lecture10 - Naïve BayesLecture10 - Naïve Bayes
Lecture10 - Naïve Bayes
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for Classification
 
Classification with Naive Bayes
Classification with Naive BayesClassification with Naive Bayes
Classification with Naive Bayes
 
Recommender system introduction
Recommender system   introductionRecommender system   introduction
Recommender system introduction
 
2013-1 Machine Learning Lecture 03 - Naïve Bayes Classifiers
2013-1 Machine Learning Lecture 03 - Naïve Bayes Classifiers2013-1 Machine Learning Lecture 03 - Naïve Bayes Classifiers
2013-1 Machine Learning Lecture 03 - Naïve Bayes Classifiers
 
Recommender system algorithm and architecture
Recommender system algorithm and architectureRecommender system algorithm and architecture
Recommender system algorithm and architecture
 

Ähnlich wie Comparing Naïve Bayesian and k-NN algorithms for automatic email classification

A Deep Analysis on Prevailing Spam Mail Filteration Machine Learning Approaches
A Deep Analysis on Prevailing Spam Mail Filteration Machine Learning ApproachesA Deep Analysis on Prevailing Spam Mail Filteration Machine Learning Approaches
A Deep Analysis on Prevailing Spam Mail Filteration Machine Learning Approachesijtsrd
 
Machine Learning
Machine LearningMachine Learning
Machine Learningbutest
 
Using construction grammar in conversational systems
Using construction grammar in conversational systemsUsing construction grammar in conversational systems
Using construction grammar in conversational systemsCJ Jenkins
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Saurabh Kaushik
 
SENTIMENT ANALYSIS FOR MOVIES REVIEWS DATASET USING DEEP LEARNING MODELS
SENTIMENT ANALYSIS FOR MOVIES REVIEWS DATASET USING DEEP LEARNING MODELSSENTIMENT ANALYSIS FOR MOVIES REVIEWS DATASET USING DEEP LEARNING MODELS
SENTIMENT ANALYSIS FOR MOVIES REVIEWS DATASET USING DEEP LEARNING MODELSIJDKP
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.docbutest
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.docbutest
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.docbutest
 
2013 03-15- Institut Jacques Monod - bioinfoclub
2013 03-15- Institut Jacques Monod - bioinfoclub2013 03-15- Institut Jacques Monod - bioinfoclub
2013 03-15- Institut Jacques Monod - bioinfoclubYannick Wurm
 
Paper id 312201523
Paper id 312201523Paper id 312201523
Paper id 312201523IJRAT
 
Multihyperkernel Customization and Analysis on OVA SVMs
Multihyperkernel Customization and Analysis on OVA SVMsMultihyperkernel Customization and Analysis on OVA SVMs
Multihyperkernel Customization and Analysis on OVA SVMsEthan Bowen
 
A report of the work done in this project is available here
A report of the work done in this project is available hereA report of the work done in this project is available here
A report of the work done in this project is available herebutest
 
activelearning.ppt
activelearning.pptactivelearning.ppt
activelearning.pptbutest
 
NE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSISNE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSISrathnaarul
 

Ähnlich wie Comparing Naïve Bayesian and k-NN algorithms for automatic email classification (20)

Does sizematter
Does sizematterDoes sizematter
Does sizematter
 
A Deep Analysis on Prevailing Spam Mail Filteration Machine Learning Approaches
A Deep Analysis on Prevailing Spam Mail Filteration Machine Learning ApproachesA Deep Analysis on Prevailing Spam Mail Filteration Machine Learning Approaches
A Deep Analysis on Prevailing Spam Mail Filteration Machine Learning Approaches
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
[PDF]
[PDF][PDF]
[PDF]
 
[PDF]
[PDF][PDF]
[PDF]
 
[PDF]
[PDF][PDF]
[PDF]
 
Using construction grammar in conversational systems
Using construction grammar in conversational systemsUsing construction grammar in conversational systems
Using construction grammar in conversational systems
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
 
SENTIMENT ANALYSIS FOR MOVIES REVIEWS DATASET USING DEEP LEARNING MODELS
SENTIMENT ANALYSIS FOR MOVIES REVIEWS DATASET USING DEEP LEARNING MODELSSENTIMENT ANALYSIS FOR MOVIES REVIEWS DATASET USING DEEP LEARNING MODELS
SENTIMENT ANALYSIS FOR MOVIES REVIEWS DATASET USING DEEP LEARNING MODELS
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.doc
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.doc
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.doc
 
2013 03-15- Institut Jacques Monod - bioinfoclub
2013 03-15- Institut Jacques Monod - bioinfoclub2013 03-15- Institut Jacques Monod - bioinfoclub
2013 03-15- Institut Jacques Monod - bioinfoclub
 
Paper id 312201523
Paper id 312201523Paper id 312201523
Paper id 312201523
 
FinalReport
FinalReportFinalReport
FinalReport
 
Multihyperkernel Customization and Analysis on OVA SVMs
Multihyperkernel Customization and Analysis on OVA SVMsMultihyperkernel Customization and Analysis on OVA SVMs
Multihyperkernel Customization and Analysis on OVA SVMs
 
Research paper
Research paperResearch paper
Research paper
 
A report of the work done in this project is available here
A report of the work done in this project is available hereA report of the work done in this project is available here
A report of the work done in this project is available here
 
activelearning.ppt
activelearning.pptactivelearning.ppt
activelearning.ppt
 
NE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSISNE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSIS
 

Mehr von butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

Mehr von butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Comparing Naïve Bayesian and k-NN algorithms for automatic email classification

  • 1. Comparing Naïve Bayesian and k-NN algorithms for automatic email classification Louis Eisenberg Stanford University M.S. student PO Box 18199 Stanford, CA 94309 650-269-9444 louis@stanfordalumni.org ABSTRACT many users are the beneficiaries of machine learning algorithms that attempt to distinguish spam from non-spam (e.g. SpamAssassin [2]). In The problem of automatic email classification contrast to the relative simplicity of spam has numerous possible solutions; a wide variety filtering – a binary decision – filing messages of natural language processing algorithms are into many folders can be fairly challenging. The potentially appropriate for this text classification most prominent non-commercial email classifier, task. Naïve Bayes implementations are popular POPFile, is an open-source project that wraps a because they are relatively easy to understand user-friendly interface around the training and and implement, they offer reasonable classification of a Naïve Bayesian system. My computational efficiency, and they can achieve personal experience with POPFile suggests that it decent accuracy even with a small amount of can achieve respectable results but it leaves training data. This paper seeks to compare the considerable room for improvement. In light of performance of an existing Naïve Bayesian the conventional wisdom in NLP research that k- system, POPFile [1], to a hand-tuned k-nearest NN classifiers (and many other types of neighbors system. Previous research has algorithms) should be able to outperform a Naïve generally shown that k-NN should outperform Bayes system in text classification, I adapted Naïve Bayes in text classification. My results fail TiMBL [3], a freely available k-NN package, to to support that trend, as POPFile significantly the email filing problem and sought to surpass outperforms the k-NN system. The likely the accuracy obtained by POPFile. explanation is that POPFile is a system specifically tuned to the email classification task DATA that has been refined by numerous people over a period of years, whereas my k-NN system is a crude attempt at the problem that fails to exploit I created the experimental dataset from my own the full potential of the general k-NN algorithm. inbox, considering the more than 2000 non-spam messages that I received in the first quarter of INTRODUCTION 2004 as candidates. Within that group, I selected approximately 1600 messages that I felt confident classifying into one of the twelve Using machine learning to classify email “buckets” that I arbitrarily enumerated (see Table messages is an increasingly relevant problem as 1). I then split each bucket and allocated half of the rate at which Internet users receive emails the messages to the training set and half to the continues to grow. Though classification of test set. desired messages by content is still quite rare,
  • 2. As input to POPFile, I kept the messages in necessary operations. To train the classifier I fed Eudora mailbox format. For TiMBL, I had to the mbx files (separated by category) directly to convert each message to a feature vector, as the provided utility script insert.pl. For testing, I described in section 3. split each test set mbx file into its individual messages, then used a simple Perl script fed the messages one at a time to the provided script Cod Size* Description pipe.pl, which reads in a message and outputs the e * academic events, talks, seminars, etc. same message with POPFile’s classification ae 86 bslf 63 buy, sell, lost, found decision prepended to the Subject header and/or c 145 courses, course announcements, etc. added in a new header called X-Test- hf 43 humorous forwards Classification. After classifying all of the na 37 newsletters, articles messages, I ran another Java program, personal popfilescore, to tabulate the results and generate p 415 politics, advocacy a confusion matrix. pa 53 se 134 social events, parties s 426 sports, intramurals, team-related ua 13 University administrative k-NN w 164 websites, accounts, e-commerce, support wb 36 work, business To implement my k-NN system I used the * - training and test combined Tilburg Memory-Based Learner, a.k.a. TiMBL. I Table 1. Classification buckets installed and ran the software on various Unix- based systems. TiMBL is an optimized version of the basic k-NN algorithm, which attempts to classify new instances by seeking “votes” from POPFILE the k existing instances that are closest/most similar to the new instance. The TiMBL POPFile implements a Naïve Bayesian algorithm. reference guide [5] explains: Naïve Bayesian classification depends on two crucial assumptions (both of which are results of Memory-Based Learning (MBL) is based on the the single Naïve Bayes assumption of conditional idea that intelligent behavior can be obtained by independence among features as described in analogical reasoning, rather than by the Manning and Schutze [4]): 1. each document can application of abstract mental rules as in rule be represented as a bag of words, i.e. the order induction and rule-based processing. In and syntax of words is completely ignored; 2. in particular, MBL is founded in the hypothesis that a given document, the presence or absence of a the extrapolation of behavior from stored given word is independent of the presence or representations of earlier experience to new absence of any other word. Naïve Bayes is thus situations, based on the similarity of the old and incapable of appropriately capturing any the new situation, is of key importance. conditional dependencies between words, guaranteeing a certain level of imprecision; Preparing the messages to serve as input to the k- however, in many cases this flaw is relatively NN algorithm was considerably more difficult minor and does not prevent the classifier from than in the Naïve Bayes case. A major challenge performing well. in using this algorithm is deciding how to represent a text document as a vector of features. To train and test POPFile, I installed the software I chose to consider five separate sections of each on a Windows system and then used a email: the attachments, the from, to and subject combination of Java and Perl to perform the headers, and the body. For attachments each
  • 3. feature was a different file type, e.g. jpg or doc. overlap (basic equals or not equals for For the other four sections, each feature was an each feature), modified value difference email address, hyperlink URL, or stemmed and metric (MVDM), and Jeffrey divergence lowercased word or number. I discarded all other • d, the class vote weighting scheme for headers. I also ignored any words of length less neighbors; this can be simple majority (all than 3 letters or greater than 20 letters and any have equal weight) or various words that appeared on POPFile’s brief alternatives, such as Inverse Linear and stopwords list. All together this resulted in each Inverse Distance, that assign higher document in the data set being represented as a weight to those neighbors that are closer vector of 15,981 features. For attachments, to the instance subject, and body, I used tf.idf weighting according to the equation: For distance metrics, MVDM and Jeffrey divergence are similar and, on this task with its numeric feature vectors, both clearly preferable weight(i,j) = (1+log(tfi,j))log(N/dfi) iff tfi,j ≥ 1, to basic overlap, which draws no distinction between two values that are almost but not quite where i is the term index and j is the document equivalent and two values that are very far apart. index. For the to and from fields, each feature The other options have no clearly superior setting was a binary value indicating the presence or a priori, so I relied on the advice of the TiMBL absence of a word or email address. reference guide and the results of my various trial The Java program mbx2featurevectors parses the runs. training or test set and generates a file containing all of the feature vectors, represented in TiMBL’s RESULTS/CONCLUSIONS Sparse format. The confusion matrices for POPFile and for the TiMBL processes the training and test data in most successful TiMBL run are reproduced in response to a single command. It has a number of Tables 2 and 3. Figure 4 compares the accuracy command-line options with which I scores of the two algorithms on each category. experimented in an attempt to extract better Table 5 lists accuracy scores for various accuracy. Among them: combinations of TiMBL options. The number of • k, the number of neighbors to consider TiMBL runs possible was limited considerably when classifying a test point: the by the length of time that each run takes – up to literature suggests that anywhere between several hours even on a fast machine, depending one and a handful of neighbors may be greatly on the exact options specified. optimal for this type of task • w, the feature weighting scheme: the classifier attempts to learn which features have more relative importance in determining the classification of an instance; this can be absent (all features get equal weight) or based on information gain or other slight variations such as gain ratio and shared variance • m, the distance metric: how to calculate the nearness of two points based on their features; options that I tried included
  • 4. ae bs c hf na p pa se s ua w wb Table 2. Confusion matrix for best TiMBL run ae 3 0 0 0 0 1 0 25 14 0 0 0 bs 0 5 0 0 0 3 0 4 19 0 0 0 c 0 1 38 0 0 12 0 8 13 0 0 0 ae bs c hf na p pa se s ua w wb hf 0 1 0 5 0 10 0 0 5 0 0 0 ae 38 0 1 0 0 0 0 0 2 0 2 0 na 1 1 0 0 5 11 0 0 0 0 0 0 bs 0 10 0 0 0 0 0 0 21 0 0 0 p 0 0 0 2 0 189 0 0 15 0 1 0 8 3 51 0 0 4 1 0 2 1 0 0 pa 0 0 0 0 0 2 13 6 5 0 0 0 0 0 0 7 0 7 1 1 4 0 0 0 se 0 2 0 1 0 8 0 27 29 0 0 0 na 0 0 0 1 32 0 0 0 0 0 0 0 s 0 1 0 0 0 28 0 6 178 0 0 0 0 10 3 8 0 140 2 7 20 0 4 4 ua 0 0 0 0 0 1 0 0 0 5 0 0 pa 3 1 0 0 0 0 18 0 2 0 1 0 w 2 0 0 0 0 41 0 0 12 0 27 0 se 0 5 2 1 0 3 0 33 20 0 0 0 w b 0 0 0 0 0 18 0 0 0 0 0 0 0 14 3 2 0 15 0 2 173 0 0 3 ua 0 0 0 0 0 0 0 0 0 6 0 0 1 0 7 0 0 4 1 2 4 2 59 0 wb 0 0 0 1 0 2 0 0 0 0 0 14 Table 3. Confusion matrix for POPFile As the tables and figure indicate, POPFile clearly outperformed even the best run by TiMBL. POPFile’s overall accuracy was 72.7%, compared to only 61.1% for the best TiMBL trial. In addition, POPFile’s accuracy was well over 60% in almost all of the categories; by contrast, the k-NN system only performed well in three categories. Interestingly, it performed best in the two largest categories, personal and sports – in fact, it was more accurate than POPFile. Apparently it succeeded in distinguishing those categories from the rest of the buckets and from each other, but failed to pick up on most of the other important differences across buckets.
  • 5. m w k d accuracy wb MVDM gain ratio 9 inv. dist. 51.0% w overlap none 1 majority 54.9% ua overlap inf. gain 15 inv. dist. 53.7% s MVDM shared var 3 inv. linear 61.1% se Jeffrey shared var 5 inv. linear 60.2% pa TiMBL overlap shared var 9 inv. linear 58.9% p na POPFile MVDM gain ratio 21 inv. dist. 49.4% MVDM inf. gain 7 inv. linear 57.4% hf MVDM shared var 1 inv. dist. 61.0% c bs MVDM shared var 5 majority 54.6% ae Table 4. Sample of TiMBL trials 0% 20% 40% 60% 80% 100% Figure 1. Accuracy by category OTHER RESEARCH The various TiMBL runs provide evidence for a A vast amount of research already exists on this few minor insights about how to get the most out and similar topics. Some people, e.g. Rennie et al of the k-NN algorithm. The overwhelming [6], have investigated ways to overcome the conclusion is that shared variance is far superior faulty Naïve Bayesian assumption of conditional to the other weighting schemes for this task. independence. Kiritchenko and Matwin [7] found Based on the explanation given in the TiMBL that support vector machines are superior to documentation, this performance disparity is Naïve Bayesian systems when much of the likely a reflection of the ability of shared training data is unlabeled. Other researchers have variance (and chi-squared, which is very similar) attempted to use semantic information to improve to avoid a bias toward features with more values accuracy [8]. – a significant problem with gain ratio. The In addition to the two models discussed in this results also suggest that k should be a small paper, there exist many other options for text number – the highest values of k gave the worst classification: support vector machines, results. The effect of the m and d options is maximum entropy and logistic models, decision unclear, though simple majority voting seems to trees and neural networks, for example. perform worse than inverse distance and inverse linear. REFERENCES It is also important to recognize the impact of the original construction of the feature vectors. [1] POPFile: http://popfile.sourceforge.net Perhaps the k-NN system’s poor performance [2] SpamAssassin: http://www.spamassassin.org was a result of unwise choices in [3] TiMBL: http://ilk.kub.nl/software.html#timbl mbx2featurevector: focusing on the wrong [4] Manning, Christopher and Hinrich Schutze. Foundations of headers, not parsing symbols and numbers as Statistical Natural Language Processing. 2000. elegantly as possible, not trying a bigram or [5] TiMBL reference guide: trigram model on the message body, choosing a http://ilk.uvt.nl/downloads/pub/papers/ilk0310.pdf poor tf.idf formula, etc. [6] Jason D. M. Rennie, Lawrence Shih, Jaime Teevan and David R. Karger. Tackling the Poor Assumptions of Naive Bayes Text Classifiers. Proceedings of the Twentieth International Conference on Machine Learning. 2003.
  • 6. [7] Svetlana Kiritchenko and Stan Matwin. Email classification [8] Nicolas Turenne. Learning Semantic Classes for Improving with co-training. Proceedings of the 2001 conference of the Email Classification. Biométrie et Intelligence. 2003. Centre for Advanced Studies on Collaborative Research. 2001.