Comparing Naive Bayesian and k-NN algorithms for automatic ...
1. Comparing Naïve Bayesian and k-NN
algorithms for automatic email classification
Louis Eisenberg
Stanford University M.S. student
PO Box 18199
Stanford, CA 94309
650-269-9444
louis@stanfordalumni.org
ABSTRACT many users are the beneficiaries of machine
learning algorithms that attempt to distinguish
spam from non-spam (e.g. SpamAssassin [2]). In
The problem of automatic email classification contrast to the relative simplicity of spam
has numerous possible solutions; a wide variety filtering – a binary decision – filing messages
of natural language processing algorithms are into many folders can be fairly challenging. The
potentially appropriate for this text classification most prominent non-commercial email classifier,
task. Naïve Bayes implementations are popular POPFile, is an open-source project that wraps a
because they are relatively easy to understand user-friendly interface around the training and
and implement, they offer reasonable classification of a Naïve Bayesian system. My
computational efficiency, and they can achieve personal experience with POPFile suggests that it
decent accuracy even with a small amount of can achieve respectable results but it leaves
training data. This paper seeks to compare the considerable room for improvement. In light of
performance of an existing Naïve Bayesian the conventional wisdom in NLP research that k-
system, POPFile [1], to a hand-tuned k-nearest NN classifiers (and many other types of
neighbors system. Previous research has algorithms) should be able to outperform a Naïve
generally shown that k-NN should outperform Bayes system in text classification, I adapted
Naïve Bayes in text classification. My results fail TiMBL [3], a freely available k-NN package, to
to support that trend, as POPFile significantly the email filing problem and sought to surpass
outperforms the k-NN system. The likely the accuracy obtained by POPFile.
explanation is that POPFile is a system
specifically tuned to the email classification task DATA
that has been refined by numerous people over a
period of years, whereas my k-NN system is a
crude attempt at the problem that fails to exploit I created the experimental dataset from my own
the full potential of the general k-NN algorithm. inbox, considering the more than 2000 non-spam
messages that I received in the first quarter of
INTRODUCTION 2004 as candidates. Within that group, I selected
approximately 1600 messages that I felt
confident classifying into one of the twelve
Using machine learning to classify email “buckets” that I arbitrarily enumerated (see Table
messages is an increasingly relevant problem as 1). I then split each bucket and allocated half of
the rate at which Internet users receive emails the messages to the training set and half to the
continues to grow. Though classification of test set.
desired messages by content is still quite rare,
2. As input to POPFile, I kept the messages in necessary operations. To train the classifier I fed
Eudora mailbox format. For TiMBL, I had to the mbx files (separated by category) directly to
convert each message to a feature vector, as the provided utility script insert.pl. For testing, I
described in section 3. split each test set mbx file into its individual
messages, then used a simple Perl script fed the
messages one at a time to the provided script
Cod Size* Description pipe.pl, which reads in a message and outputs the
e * academic events, talks, seminars, etc. same message with POPFile’s classification
ae 86
bslf 63 buy, sell, lost, found decision prepended to the Subject header and/or
c 145 courses, course announcements, etc. added in a new header called X-Test-
hf 43 humorous forwards Classification. After classifying all of the
na 37 newsletters, articles messages, I ran another Java program,
personal popfilescore, to tabulate the results and generate
p 415
politics, advocacy
a confusion matrix.
pa 53
se 134 social events, parties
s 426 sports, intramurals, team-related
ua 13 University administrative k-NN
w 164 websites, accounts, e-commerce, support
wb 36 work, business To implement my k-NN system I used the
* - training and test combined Tilburg Memory-Based Learner, a.k.a. TiMBL. I
Table 1. Classification buckets installed and ran the software on various Unix-
based systems. TiMBL is an optimized version of
the basic k-NN algorithm, which attempts to
classify new instances by seeking “votes” from
POPFILE the k existing instances that are closest/most
similar to the new instance. The TiMBL
POPFile implements a Naïve Bayesian algorithm. reference guide [5] explains:
Naïve Bayesian classification depends on two
crucial assumptions (both of which are results of Memory-Based Learning (MBL) is based on the
the single Naïve Bayes assumption of conditional idea that intelligent behavior can be obtained by
independence among features as described in analogical reasoning, rather than by the
Manning and Schutze [4]): 1. each document can application of abstract mental rules as in rule
be represented as a bag of words, i.e. the order induction and rule-based processing. In
and syntax of words is completely ignored; 2. in particular, MBL is founded in the hypothesis that
a given document, the presence or absence of a the extrapolation of behavior from stored
given word is independent of the presence or representations of earlier experience to new
absence of any other word. Naïve Bayes is thus situations, based on the similarity of the old and
incapable of appropriately capturing any the new situation, is of key importance.
conditional dependencies between words,
guaranteeing a certain level of imprecision; Preparing the messages to serve as input to the k-
however, in many cases this flaw is relatively NN algorithm was considerably more difficult
minor and does not prevent the classifier from than in the Naïve Bayes case. A major challenge
performing well. in using this algorithm is deciding how to
represent a text document as a vector of features.
To train and test POPFile, I installed the software I chose to consider five separate sections of each
on a Windows system and then used a email: the attachments, the from, to and subject
combination of Java and Perl to perform the headers, and the body. For attachments each
3. feature was a different file type, e.g. jpg or doc. overlap (basic equals or not equals for
For the other four sections, each feature was an each feature), modified value difference
email address, hyperlink URL, or stemmed and metric (MVDM), and Jeffrey divergence
lowercased word or number. I discarded all other
• d, the class vote weighting scheme for
headers. I also ignored any words of length less
neighbors; this can be simple majority (all
than 3 letters or greater than 20 letters and any
have equal weight) or various
words that appeared on POPFile’s brief
alternatives, such as Inverse Linear and
stopwords list. All together this resulted in each
Inverse Distance, that assign higher
document in the data set being represented as a
weight to those neighbors that are closer
vector of 15,981 features. For attachments,
to the instance
subject, and body, I used tf.idf weighting
according to the equation: For distance metrics, MVDM and Jeffrey
divergence are similar and, on this task with its
numeric feature vectors, both clearly preferable
weight(i,j) = (1+log(tfi,j))log(N/dfi) iff tfi,j ≥ 1,
to basic overlap, which draws no distinction
between two values that are almost but not quite
where i is the term index and j is the document equivalent and two values that are very far apart.
index. For the to and from fields, each feature The other options have no clearly superior setting
was a binary value indicating the presence or a priori, so I relied on the advice of the TiMBL
absence of a word or email address. reference guide and the results of my various trial
The Java program mbx2featurevectors parses the runs.
training or test set and generates a file containing
all of the feature vectors, represented in TiMBL’s RESULTS/CONCLUSIONS
Sparse format.
The confusion matrices for POPFile and for the
TiMBL processes the training and test data in
most successful TiMBL run are reproduced in
response to a single command. It has a number of
Tables 2 and 3. Figure 4 compares the accuracy
command-line options with which I
scores of the two algorithms on each category.
experimented in an attempt to extract better
Table 5 lists accuracy scores for various
accuracy. Among them:
combinations of TiMBL options. The number of
• k, the number of neighbors to consider TiMBL runs possible was limited considerably
when classifying a test point: the by the length of time that each run takes – up to
literature suggests that anywhere between several hours even on a fast machine, depending
one and a handful of neighbors may be greatly on the exact options specified.
optimal for this type of task
• w, the feature weighting scheme: the
classifier attempts to learn which features
have more relative importance in
determining the classification of an
instance; this can be absent (all features
get equal weight) or based on information
gain or other slight variations such as gain
ratio and shared variance
• m, the distance metric: how to calculate
the nearness of two points based on their
features; options that I tried included
4. ae bs c hf na p pa se s ua w wb
Table 2. Confusion matrix for best TiMBL run
ae 3 0 0 0 0 1 0 25 14 0 0 0
bs 0 5 0 0 0 3 0 4 19 0 0 0
c 0 1 38 0 0 12 0 8 13 0 0 0 ae bs c hf na p pa se s ua w wb
hf 0 1 0 5 0 10 0 0 5 0 0 0 ae 38 0 1 0 0 0 0 0 2 0 2 0
na 1 1 0 0 5 11 0 0 0 0 0 0 bs 0 10 0 0 0 0 0 0 21 0 0 0
p 0 0 0 2 0 189 0 0 15 0 1 0 8 3 51 0 0 4 1 0 2 1 0 0
pa 0 0 0 0 0 2 13 6 5 0 0 0 0 0 0 7 0 7 1 1 4 0 0 0
se 0 2 0 1 0 8 0 27 29 0 0 0 na 0 0 0 1 32 0 0 0 0 0 0 0
s 0 1 0 0 0 28 0 6 178 0 0 0 0 10 3 8 0 140 2 7 20 0 4 4
ua 0 0 0 0 0 1 0 0 0 5 0 0 pa 3 1 0 0 0 0 18 0 2 0 1 0
w 2 0 0 0 0 41 0 0 12 0 27 0 se 0 5 2 1 0 3 0 33 20 0 0 0
w
b 0 0 0 0 0 18 0 0 0 0 0 0 0 14 3 2 0 15 0 2 173 0 0 3
ua
0 0 0 0 0 0 0 0 0 6 0 0
1 0 7 0 0 4 1 2 4 2 59 0
wb 0 0 0 1 0 2 0 0 0 0 0 14
Table 3. Confusion matrix for POPFile
As the tables and figure indicate, POPFile clearly
outperformed even the best run by TiMBL.
POPFile’s overall accuracy was 72.7%,
compared to only 61.1% for the best TiMBL
trial. In addition, POPFile’s accuracy was well
over 60% in almost all of the categories; by
contrast, the k-NN system only performed well in
three categories. Interestingly, it performed best
in the two largest categories, personal and sports
– in fact, it was more accurate than POPFile.
Apparently it succeeded in distinguishing those
categories from the rest of the buckets and from
each other, but failed to pick up on most of the
other important differences across buckets.
5. m w k d accuracy
wb MVDM gain ratio 9 inv. dist. 51.0%
w overlap none 1 majority 54.9%
ua overlap inf. gain 15 inv. dist. 53.7%
s
MVDM shared var 3 inv. linear 61.1%
se
Jeffrey shared var 5 inv. linear 60.2%
pa TiMBL overlap shared var 9 inv. linear 58.9%
p
na
POPFile MVDM gain ratio 21 inv. dist. 49.4%
MVDM inf. gain 7 inv. linear 57.4%
hf
MVDM shared var 1 inv. dist. 61.0%
c
bs
MVDM shared var 5 majority 54.6%
ae
Table 4. Sample of TiMBL trials
0% 20% 40% 60% 80% 100%
Figure 1. Accuracy by category OTHER RESEARCH
The various TiMBL runs provide evidence for a A vast amount of research already exists on this
few minor insights about how to get the most out and similar topics. Some people, e.g. Rennie et al
of the k-NN algorithm. The overwhelming [6], have investigated ways to overcome the
conclusion is that shared variance is far superior faulty Naïve Bayesian assumption of conditional
to the other weighting schemes for this task. independence. Kiritchenko and Matwin [7] found
Based on the explanation given in the TiMBL that support vector machines are superior to
documentation, this performance disparity is Naïve Bayesian systems when much of the
likely a reflection of the ability of shared training data is unlabeled. Other researchers have
variance (and chi-squared, which is very similar) attempted to use semantic information to improve
to avoid a bias toward features with more values accuracy [8].
– a significant problem with gain ratio. The In addition to the two models discussed in this
results also suggest that k should be a small paper, there exist many other options for text
number – the highest values of k gave the worst classification: support vector machines,
results. The effect of the m and d options is maximum entropy and logistic models, decision
unclear, though simple majority voting seems to trees and neural networks, for example.
perform worse than inverse distance and inverse
linear. REFERENCES
It is also important to recognize the impact of the
original construction of the feature vectors. [1] POPFile: http://popfile.sourceforge.net
Perhaps the k-NN system’s poor performance [2] SpamAssassin: http://www.spamassassin.org
was a result of unwise choices in [3] TiMBL: http://ilk.kub.nl/software.html#timbl
mbx2featurevector: focusing on the wrong [4] Manning, Christopher and Hinrich Schutze. Foundations of
headers, not parsing symbols and numbers as Statistical Natural Language Processing. 2000.
elegantly as possible, not trying a bigram or [5] TiMBL reference guide:
trigram model on the message body, choosing a http://ilk.uvt.nl/downloads/pub/papers/ilk0310.pdf
poor tf.idf formula, etc. [6] Jason D. M. Rennie, Lawrence Shih, Jaime Teevan and
David R. Karger. Tackling the Poor Assumptions of Naive
Bayes Text Classifiers. Proceedings of the Twentieth
International Conference on Machine Learning. 2003.
6. [7] Svetlana Kiritchenko and Stan Matwin. Email classification [8] Nicolas Turenne. Learning Semantic Classes for Improving
with co-training. Proceedings of the 2001 conference of the Email Classification. Biométrie et Intelligence. 2003.
Centre for Advanced Studies on Collaborative Research.
2001.