Weitere ähnliche Inhalte
Ähnlich wie Prepare black list using bayesian approach to improve performance of spam filter 2
Ähnlich wie Prepare black list using bayesian approach to improve performance of spam filter 2 (20)
Mehr von IAEME Publication
Mehr von IAEME Publication (20)
Prepare black list using bayesian approach to improve performance of spam filter 2
- 1. INTERNATIONALComputer EngineeringCOMPUTER ENGINEERING
International Journal of JOURNAL OF and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 1,(IJCET)
& TECHNOLOGY January- February (2013), © IAEME
ISSN 0976 – 6367(Print)
ISSN 0976 – 6375(Online)
Volume 4, Issue 1, January- February (2013), pp. 318-324
IJCET
© IAEME:www.iaeme.com/ijcet.asp
Journal Impact Factor (2012): 3.9580 (Calculated by GISI) ©IAEME
www.jifactor.com
PREPARE BLACK LIST USING BAYESIAN APPROACH TO
IMPROVE PERFORMANCE OF SPAM FILTER
Nitin Rola1, Prof. Rashmi Gupta2
1
Computer Science & Engineering, TIT, Bhopal
2
Computer Science & Engineering, TIT, Bhopal
ABSTRACT
Email is very secure, cheap, easy and reliable communication medium, but it has one
big disadvantage that is of spam (junk) Email. Solution of this spam is automatic filtering
system which eliminates (spam) unwanted mails. Bayesian approach is efficient and powerful
for doing this task. Bayesian approach seems to be simple text classification technique, but
right now many researches are going on the same because cost of misclassification of the
legitimate to spam is very high. Here we have considered an origin and a Bayesian approach
for filtering spam mail.So, the major issue in Bayesian approach is performance of filter
when word library become very large. To improve performance we can first classify on the
basis of origin (black list) of e-mail then classify it by Bayesian approach to make it more
accurate and faster.
Keywords:Automated Accurate and Faster Spam Filter, Train Origin Database by Bayesian
Approach, Self Learning.
I. INTRODUCTION
It is rapid information exchange Era and one of the advances, secure, cheap, reliable
and fast technologies for information exchange is Email. Users of Emails are increasing day
by day and also increasing the volume of unwanted mails (spam). Also popular medium of
communication for E – Commerce is Email which has opened the door for direct marketers to
bombard the mails which fills the mail boxes of users with unwanted mails and as same copy
of mail is there on many users mailbox on same server it is just wastage of resource and also
waste of bandwidth. Spam mail is also called as unsolicited bulk mail or junk, so we say
spam Email is unwanted internet Email. Spam is an ever-increasing problem. The number of
spam mails is increasing daily – studies show that over 90% of all current email is spam.
Added to this, spammers are becoming more sophisticated and are constantly managing to
outsmart ‘static’ methods of fighting spam. The techniques currently used by most anti-spam
318
- 2. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 1, January- February (2013), © IAEME
January
software are static, meaning that it is fairly easy to evade by tweaking the message a little. To
do this, spammers simply examine the latest anti spam techniques and find ways how to
anti-spam
dodge them. To effectively combat spam, an adaptive new technique is needed. This method
must be familiar with spammers’ tactics as they change over time. It must also be able to
h
adapt to the particular organization that it is protecting from spam. The answer lies in
Bayesian mathematics. In following figure we can see Max spam mail 34.7 sent per second,
total spam sent in last month 12666548 mails.
am
Fig 1: SpamCop Statistics
For filtering here we combine two approach origin and Bayesian for speed and
accuracy. Origin technique provides high speed but it has no accuracy and Bayesian provide
high accuracy but it has no speed. So here we take advantage of both technique and develop
highly accurate and faster spam filter.
II. ORIGIN-BASED FILTER
Origin based filters are methods which based on using network information in order
to detect whether it is spam or not.[1] IP and the email address are the most important pieces
of network information used.[1] There are several major types of origin-Based filters such as
origin Based
Blacklists, White lists, and Challenge/Response systems.[1] Here we will use Blacklists
technique and maintain black list by self learning technique. We will train black list database
ain
from spam mail which classified by Bayesian.
III. BAYESIAN APPROACH
Naive Bayesian is a fundamental statistical approach based on probability initially
proposed by Sahami et al. (1998).[2] The Bayesian algorithm predicts the classification of
(1998).[2]
new e-mail by identifying an e-mail as spam or legitimate.[2] This is achieved by looking at
mail
the features using a ‘training set’ which has already been pre-classified correctly and then
pre classified
checking whether a particular word appears in the e-mail. High probability indicates the new
e mail.
e-mail as spam e-mail.[2]
319
- 3. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 1, January- February (2013), © IAEME
A Bayesian classifier is simply a Bayesian network applied to a classification task.[2] It
contains a node C representing a class variable (Junk Or Legitimate) and a node Xi for each of the
feature (each of the words). Given a specific instance x(an assignment of values x1,x2,x3,..........,xn to
a feature variables), the Bayesian network allows us to compute the probability P(C=ck/X=x) for each
possible class ck. this is done via Bayes theorem, giving us
Bayes:
PሺC ൌ ck | X ൌ xሻ PሺC ൌ ckሻ
PሺC ൌ ck | X ൌ xሻ ൌ
ܲሺܺ ൌ ݔሻ
In the context of the classification, specifically junk Email filtering, it becomes necessary to
represent mail message as feature vectors so as to make such Bayesian classification methods directly
applicable.
IV. ACTUAL IMPLEMENTATION
We divided this implementation into following three parts.
A. Training
B. Classification
A. Training
In Training part we have to train following three database of Spam Filter.
• Origin Email id with counter (Blacklist).
• Spam with counter.
• Legitimate with counter.
For our system we have used some mails from following E-mail ID to train the database.
• enr.nitinrola@gmail.com
• aakash.siddhpura@yahoo.co.in
• rohit.it409@gmail.com
In this algorithm we have neglected some common occurring words, list of these words are
as below
hi, hello, dear, regards, thank, thanks, of, into, they, she, it, been, he, in, the, how, where, an,
out, you, i, am, there, not, can, could, would, will, if, has, have, why, who ,had, with, your, or,
any, my, we, so, date, to, from, mon, monday, tue, tuesday, wed, wednesday, thu, thursday, fri,
friday, sat, saturday, sun, sunday, jan, feb, mar, apr, may, jun, jul, aug, sep, oct, nov, dec, let,
make, put, seem, take, about, among, at , between, now, out, still, almost, even, much, quite,
very, please.
A.1 Training (Algorithm)
1. After classification retrieve sender email id of all spam mail.
2. If sender email id of spam mail is available in origin (blacklist) database then just
increase its count, otherwise insert email id in origin (blacklist) database.
3. Retrieve sender email id of all legitimate email.
4. If sender email id of legitimate mail is available in origin (blacklist) database then set
value of count is zero.
5. Extract features (word) from all spam mail
6. Update database of spam mail; if word available then increase its count by one
otherwise insert it as new word with count one in spam databases.
7. Update database of legitimate mail; if word available then increase its count by one
otherwise insert it as new word with count one in legitimate databases.
8. Database improvement is complete.
320
- 4. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 1, January- February (2013), © IAEME
A.2 Training (Flow Chart)
Retrieve sender email id of all spam
If sender email id is available in
origin database
No
Yes
Increase counter of this email id in Insert as a new entry in origin
origin database database
Retrieve sender email id of all Legitimate mail
If sender email id of legitimate
mail is available in origin database
No
Yes
Set counter value as zero Insert as a new entry in origin
Retrieve word of all legitimate mail
If word is available in legitimate database
Increase counter value by 1 Insert as a new word
Retrieve word of all spam mail
If word is available in spam database
No
Increase counter value by 1 Insert as a new word
Yes
Training Process complete
321
- 5. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 1, January- February (2013), © IAEME
A.3 Classification Process (Algorithm)
1. Download new mail.
2. Retrieve Origin or sender email id.
3. If there is no sender id then classify as a spam.
4. If sender email id available in origin database then check its count, if count is
greater than 20 then classify this mail is a spam otherwise send this mail in second
level (Bayesian) to classify.
5. In second level (Bayesian) Receive mail which is not classified by first level
(Origin).
6. Extract features (word) from all mail and store it in temporary database with
frequency of occurrence in same mail.
7. If there is no text in mail then classify as a spam.
8. If there is any attachment then give message to check this mail because filter is
not able to read attachment.
9. Calculate probability for spam and legitimate by above Bayesian formula for each
word.
10. Store probability of each word for spam and legitimate in temporary database.
11. Calculate sum of probability of all word of same file for spam and legitimate.
12. If sum of probability for spam is greater than legitimate then classify as spam
otherwise legitimate.
13. If sum of probability for spam and legitimate is same then classify as legitimate.
14. Classification process is complete.
A.4 Classification Process (Flow Chart)
New Mail
Retrieve Sender ID
If sender ID is available in Origin Database
and count >20
Yes
Classify as a Spam
No
Extract features (word)
Calculate probabilities in Spam
If Spam_Prob>Leig_Prob
Yes No
Classify as a Spam Classify as a Legitimate
Update Database for Self Learning
322
- 6. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 1, January- February (2013), © IAEME
V. RESULTS
TABLE 1
Total Mail = 28
Spam Legitimate Actual Spam Actual Legitimate
Origin 5 23 23 5
Bayesia 17 6 18 5
n
TABLE 2
Total Mail = 17
Spam Legitimate Actual Spam Actual Legitimate
Origin 6 11 13 4
Bayesia 9 4 9 4
n
In table 1 we can see 5 mails are classified at origin level out of 28. So, in second
level just check content of 23 mails which not classified as spam in origin level.
In table 2 we can see 6 mails are classified at origin level out of 17. So, in second
level just check content of 11 mails which not classified as spam in origin level.
In origin level it cannot give accuracy if some mail arrive from different email id then
it will classify it as a legitimate. So here we use Bayesian approach in second level to
improve accuracy, give input all mails which are classified legitimate by Origin in Level 1. If
we not use Origin then Bayesian have to check contents of all mails and it will degrade the
performance of filter.
VI. CONCLUSION
In the time of growing problem of Junk Email, we have made a system which
classifies junk mail automatically; this system uses the concept of Origin and Bayesian
theorem for classification task. The efficiency of this kind of system is enhanced by
considering not only words of mail as feature but we can consider other domain specific
features which provide strong evidence about Junk. Also we can set some manually made
handy rules along with system to improve system performance. Here we have not considered
header of the mail so in future work we can use header to improve system accuracy.
REFERENCES
Journal Papers:
[1] ThamaraiSubramaniam, Hamid A. Jalab and Alaa Y. Taqa, Overview of textual anti-spam
filtering techniques, International Journal of the Physical Sciences Vol. 5(12), pp. 1869-
1882, 4 October, 2010
[2] Alia TahaSabri, Adel HamdanMohammads, Bassam Al-Shargabi and Maher Abu
Hamdeh, Developing New Continuous Learning Approach for Spam Detection using
Artificial Neural Network (CLA_ANN), European Journal of Scientific Research ISSN
1450-216X Vol.42 No.3 (2010), pp.525-535 © EuroJournals Publishing, Inc. 2010
323
- 7. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 1, January- February (2013), © IAEME
[3] Ahmed Khorsi, An Overview of Content-Based Spam Filtering Techniques,
Informatica31 (2007) 269-277
[4] Giorgio Fumera, IgnazioPillai and Fabio Roli, Spam Filtering Based On The Analysis Of
Text Information Embedded Into Images, Journal of Machine Learning Research 7 (2006)
2699-2720
[5] Ms. JyotiPruthi and Dr. Ela Kumar, ”Data Set Selection In Anti-Spamming Algorithm -
Large Or Small”, International Journal of Computer Engineering and Technology
(IJCET), Volume 3, Issue 2, 2012, pp.206-212. Published by IAEME.
[6] C.R. Cyril Anthoni and Dr. A. Christy, ”Integration Of Feature Sets With Machine
Learning Techniques For Spam Filtering”, International Journal of Computer Engineering
and Technology (IJCET), Volume 2, Issue 1, 2011, pp.47-52. Published by IAEME.
Theses:
[7] Jon Kagstrom, Improving Naive Bayesian Spam Filtering, Mid Sweden University
Department for Information Technology and Media Spring 2005
[8] Thomas Richard Lynam, Spam Filter Improvement Through Measurement, Waterloo,
Ontario, Canada, 2009
[9] CsabaGulyas, Creation of a Bayesian network-based meta spam filter, using the analysis
of different spam filters, Budapest, 16th May 2006
Proceedings Papers:
[10] Vikas P. Deshpande, Robert F. Erbacher, and Chris Harris, An Evaluation of Naïve
Bayesian Anti-Spam Filtering Techniques, Proceedings of the 2007 IEEE Workshop on
Information Assurance United States Military Academy, West Point, NY 20-22 June
2007
[11] YanhuiGuo, Yaolong Zhang, Jianyi Liu and Cong Wang, Research on the
Comprehensive Anti-Spam Filter, 9701-0/06/$20.00 02006 IEEE.
[12] xi-lin zhao1, jian-zhongzhou, bofu and huilui, Research of Probability Petri Nets Model
For Fault Diagnosis Based on Bayesian theorem, Proceedings of the 7th World Congress
on Intelligent Control and Automation June 25 - 27, 2008, Chongqing, China
[13] BijuIssac, Wendy Japutra Jap and JofryHadiSutanto, Improved Bayesian Anti-Spam
Filter Implementation and Analysis on Independent Spam Corpuses, 2009 International
Conference on Computer Engineering and Technology
[14] Chengcheng Li and Jianyi Liu, Combining Behavior And Bayesian Chinese Spam Filter,
Proceedings of IC-NIDC2009
[15] Yishan Gong and Qiang Chen, Research of Spam Filtering Based on Bayesian
Algorithm, 2010 International Conference on Computer Application and System
Modeling (ICCASM 2010)
324