SlideShare ist ein Scribd-Unternehmen logo
1 von 8
Downloaden Sie, um offline zu lesen
International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 1, February 2014
DOI : 10.5121/ijcsit.2014.6111 147
Processing Obtained Email Data By Using Naïve
Bayes Learning Algorithm
Valery V. Pelenko, Aleksandr V. Baranenko
St. Petersburg Research University of Information Technologies, Mechanics and Optics.
Institute of Refrigeration and Biotechnology
9 Lomonosova St., St Petersburg, 191002, Russia
ABSTRACT
This paper gives a basic idea how various machine learning techniques may be applied towards processing
the data from DEA services to find out whether people use these services for legitimate or non-legitimate
purposes.
KEYWORDS
Spam, ham, emails, communications, machine learning, Naïve Bayes, DEA services.
1. INTRODUCTION
In this paper we are characterizing a set of found email data using a method of supervised
learning – Naïve Bayes. There are lots of DEA services, but we decided to get the data from most
commonly used DEA services [1]. The data set was chosen from five independent disposable
email address (DEA) services: dispostable, mailinator, mytrashmail, staging and tempemail. The
DEA allow users to receive the emails without creating an account. Not much is known how
people use these services; in fact, there are so many legitimate and non-legitimate purposes that
people use these services for. Since the accounts don’t require the password to access them, this
data becomes public. Some users may not realize how public this data are and that anyone else
can purposely or accidentally access this private information. We decided to go ahead and
categorize the found email data and split it up in four main categories. Our goal is to find out how
much of the information that is stored in the DEA services is legitimate. At the same time, the
main purpose for this paper was to break down the data usage and present the statistics of the
DEA services.
We found that about 72% of the data was tagged ‘spam’ comparing to only 6.7% of legitimate
data which was tagged ‘ham’. The remaining part was split unevenly between 6.2% of ‘other’ and
15.1% of ‘Non English’. We used Naïve Bayes to split the data into the categories mentioned
above.
During thus project, we were dealing with the data that was obtained from five various
independent DEA services. DEA services are such services that do not require a user to create an
account in order to receive an email to any email address supported by the service. Thus, all the
data becomes public due to no identification required. Since the DEA services are used by many
users from all over the world for a variety of purposes, then the data that was obtained such
services is obviously to be very diverse. It is very important to understand that users may use
DEA services for both legitimate and non-legitimate purposes [2]. Since all the data is public and
International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 1, February 2014
148
some users use the DEA services for legitimate purposes, these users may not realize how public
the data are and what risk they take. This is the main factor why DEA services may not be very
helpful as they seem to be at first. In order to protect personal information and avoid an identity
theft, the users should be informed beforehand what risk they take. Some non-legitimate users
may find this personal data such as: name, address, SSN, credit card number, and others very
useful and the victim can suffer adverse consequences if they are held accountable for the
perpetrator's actions.
The main purpose of the paper was to characterize this found email data. Once we are done
splitting the data into different categorize, we can look at the informative data. The informative
data would be categorized as ‘ham’. A closer look at the messages in the ‘ham’ category will
benefit us in some ways such are presented below:
• Understand what people use DEA services for legitimate purposes.
• What did force these users use DEA service over the standard email account?
• Is there any risk that people take when using DEA services for legitimate purposes, if
there is, how high is that risk?
After processing all the data, We will be able to demonstrate how much of this data are used for
legitimate purposes, non-legitimate purposes and any other purposes (if any); show whether the
users take any risk of loss of any personal information when using DEA services
2. FOUND EMAIL DATA SETS
Disposable email addressing (DEA) is a way of sharing and managing email addressing [3]. DEA
allows user to set up a new, unique email address for every contact or use an already existing
email address that only requires having a username to access the existing account in order to
make a connection between the sender and the recipient.
Our main idea is that the DEA services are mostly used by people who try to avoid using their
personal email accounts due to various factors including receiving spam [3,4]. Thus, these people
prefer to use DEA services to stay anonymous.
The main advantage of using the DEA services is that there is no need to register by using your
real credentials and, of course, these services are free of charge. The user is given a choice to
select any name for the email address and use it, even if it was used before [5].
As any other service, there are also disadvantages of using the DEA services. If this account has
been used before, then the user will be able to track all the messages in this account and some
private information may become public. Many forums and legitimate services filter out messages
sent from DEA domains.
3. METHODOLOGY
The primary work for this paper was to characterize the found email data from various email
services listed below:
• Dispostable
• Mailinator
• Mytrashmail
• Staging
• Tempemail
International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 1, February 2014
149
We have competed two separate stages of characterizing the data. We were mainly interested in
breaking down the category of “ham” email versus “spam” category. The categories used at the
first and second stages are listed below:
Stage 1:
• Spam
The messages considered to be under the category of spam if it’s obvious that the
recipient has no pre-existing relations with the sender. For example, any message that
looks like an advertisement and it was sent to thousands of people simultaneously, this
message will be considered as spam [4].
• Ham
The messages considered to be under the category of ham if the user seems to have pre-
existing relation such as the message seems to indicate a specific action taken by the
person to cause it (an invoice for purchased products, or a response mentioning that a
purchase could not go through etc.) will be considered ham [4].
• Non English
The messages considered to be non English if the original language in what the message
was written is other than English, e.g., Russian, Spanish, Italian, Chinese, Arabic,
Hebrew, Ukrainian (the listed languages were found in the found email data set).
• Other
The messages considered to be under the category of other if the messages couldn’t be
assigned to any other category mentioned above. For example, an email with no text in it
and a plain black background would be described as other.
• Errors
The messages considered to be under the category of other if the messages couldn’t be
parsed by program or broken files.
Stage 2:
• Buying
The messages considered to be under the category of buying if the email that was
received by the recipient mentions that he or she has purchased something. For example,
an email received from paypal service would be considered as buying email.
• Signing Up
The messages considered to be under the category of signing up if the email that was
received by the percipient shows that the user has subscribed to some service. For
example, an email received from the forum web sites is considered as signing up email.
International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 1, February 2014
150
• Non Ham (Spam)
The definition provided above in stage 1.
• Personal
The messages considered to be under the category of personal if the user has requested a
subscription from the social websites. For example, an email received from Linkedin is
considered as a personal email.
The found email data, as mention above, was taken from five independent DEA services.
Please refer to the table to see what the size of the data was from each service.
DEA Service Size of the data (GB) % of data
Dispostable 2.1 24.21%
Mailinator 0.009 0.010%
Mytrashmail 0.22 0.253%
Staging 0.095 1.095%
Tempemail 6.25 74.432%
Overall 8.674 100%
There are two main sets of machine learning techniques that can be used for classifying the data:
supervised and unsupervised learning [6]. The supervised learning is when the data is tagged
before the algorithm makes any further decisions. On the other hand, if there is no input to the
algorithm, then unsupervised learning has to be used. Algorithms in the group of unsupervised
learning find the similarities and/or correlations in the data and require no input to classify the
data.
There was a choice of using either supervised or unsupervised learning. Since the data set of
found email data was 8.67 GB, WE decided to use supervised learning. Naïve Bayes method was
chosen to be used as a method for supervised learning [6].
For the purposes of the describing the data, we decided to use the Naïve Bayes algorithm which is
suitable for handling big sets of data, in our example the data set is 8.764 GB and diverse data.
A Naïve Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem
with naive independence assumptions. In simple terms, a naive Bayes classifier assumes that the
presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of
any other feature. For example, a fruit may be considered to be an apple if it is red, round, and
about 4" in diameter. Even if these features depend on each other or upon the existence of the
other features, a naive Bayes classifier considers all of these properties to independently
contribute to the probability that this fruit is an apple.
Depending on the precise nature of the probability model, Naïve Bayes classifiers can be trained
very efficiently in a supervised learning setting [6]. In many practical applications, parameter
estimation for Naïve Bayes models uses the method of maximum likelihood; in other words, one
can work with the Naïve Bayes model without believing in Bayesian probability or using any
Bayesian methods [7].
International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 1, February 2014
Abstractly, the probability model for a classifier is a conditional model:
dependent class variable C with a small number of outcomes or classes, conditional on several
feature variables 1F through nF . The problem is that if the number of features
a feature can take on a large number of values, then basing such a model on probability tables is
infeasible [6]. We therefore reformulate the model to make it more tractable.
Using Bayes' theorem, this can be written as following:
,...,(
,...,()(
),...,|(
1
1
1 n
Fp
FpCp
FFCP =
For the purpose of understanding, in simple English the equation provided above can be written
as following:
evidence
likelihoodprior
posterior
*
=
4. RESULTS
Due to two factors such as diversity of the data and the size of the data, the algorithm was run
twice in order to split the data. At the first stage the following tags were introduced into the
model. The description of each tag is provided in the introduction section. The size of data
considered at the first stage was 8.76 GB.
1. Spam
2. Ham
3. Non English
4. Other
5. Errors
After training the algorithm, the size of the database which contains the words and its
probabilities was 1878 KB (234 messages). The results obtained after the first stage are as
following:
83726
0
50000
100000
150000
200000
250000
300000
350000
400000
Non English
International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 1, February 2014
Abstractly, the probability model for a classifier is a conditional model: ,...,|( 1FCP
with a small number of outcomes or classes, conditional on several
. The problem is that if the number of features n is large or when
mber of values, then basing such a model on probability tables is
therefore reformulate the model to make it more tractable.
Using Bayes' theorem, this can be written as following:
),...,
)|,...,
n
n
F
CF
understanding, in simple English the equation provided above can be written
likelihood
Due to two factors such as diversity of the data and the size of the data, the algorithm was run
data. At the first stage the following tags were introduced into the
model. The description of each tag is provided in the introduction section. The size of data
considered at the first stage was 8.76 GB.
the algorithm, the size of the database which contains the words and its
probabilities was 1878 KB (234 messages). The results obtained after the first stage are as
77090
33559
347191
11418
Ham Other Spam Errors
# of messages
Stage 1 - Results
International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 1, February 2014
151
),..., nF over a
with a small number of outcomes or classes, conditional on several
is large or when
mber of values, then basing such a model on probability tables is
understanding, in simple English the equation provided above can be written
Due to two factors such as diversity of the data and the size of the data, the algorithm was run
data. At the first stage the following tags were introduced into the
model. The description of each tag is provided in the introduction section. The size of data
the algorithm, the size of the database which contains the words and its
probabilities was 1878 KB (234 messages). The results obtained after the first stage are as
International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 1, February 2014
Since the main idea was to break down the category of ham email, the algorithm at the second
stage filtered out the emails that were tagged as ‘Ham’ after completing the first stage. In order to
improve the accuracy of the algorithm,
at the second stage. The list of the tags is introduced below (the description of the tags in
provided in the introduction section). The size of data considered at the second stage was 1.24
GB.
1. Buying
2. Signing Up
3. Non Ham (Spam)
4. Personal
After training the algorithm, the size of the database which contains the words and its
probabilities was 1620 KB (202 messages). The results obtained after the first stage are
as following:
For the purposes of easier understanding,
category break-down:
0 10000
Buying
Signing Up
Non Ham (Spam)
Personal
68.31%
7.12%
International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 1, February 2014
Since the main idea was to break down the category of ham email, the algorithm at the second
stage filtered out the emails that were tagged as ‘Ham’ after completing the first stage. In order to
improve the accuracy of the algorithm, we had to include the ‘Spam’ tag into the list of tags used
at the second stage. The list of the tags is introduced below (the description of the tags in
provided in the introduction section). The size of data considered at the second stage was 1.24
After training the algorithm, the size of the database which contains the words and its
probabilities was 1620 KB (202 messages). The results obtained after the first stage are
For the purposes of easier understanding, the graph below shows percentage values of the
10000 20000 30000 40000 50000
Stage 2 - Results
# of messages
5.36%
19.21%
7.12%
Buying
Signing Up
Non Ham (Spam)
Personal
International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 1, February 2014
152
Since the main idea was to break down the category of ham email, the algorithm at the second
stage filtered out the emails that were tagged as ‘Ham’ after completing the first stage. In order to
‘Spam’ tag into the list of tags used
at the second stage. The list of the tags is introduced below (the description of the tags in
provided in the introduction section). The size of data considered at the second stage was 1.24
After training the algorithm, the size of the database which contains the words and its
probabilities was 1620 KB (202 messages). The results obtained after the first stage are
the graph below shows percentage values of the
Non Ham (Spam)
International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 1, February 2014
153
Due to the big amount of data, the algorithm accuracy was computer using 50 messages as
following:
%100*
50
# sagesMatchedMes
Accuracy = ; the message is considered to be matched if the tag
assigned by human is the same as assigned by the Naïve Bayes Classifier.
%82%100*
50
41
==Accuracy
5. CONCLUSION
After completing all the stages of the algorithm it’s obvious that mostly the DEA services are
used for ‘spam’, which is more than 70%. The minority of usage belongs to the ‘ham’ category;
it’s only 6.7% of all the data. The other small usage goes under the category ‘other’; it’s only
6.2%. It’s important to note that a little less 1/5 of the data is used by foreigners (i.e. in language
other than English). The detailed breakdown is provided below:
Category # of messages % of messages
Non English 83726 15.13%
Ham 36746 6.66%
Other 33559 6.07%
Spam 387535 70.1%
Errors 11418 2.04%
The main purpose of the paper was to classify and breakdown the ‘ham’ category of emails, but it
couldn’t be done without completing two stages of the algorithm and split out the ‘spam’ data at
the second stage. Mostly the ‘ham’ emails are used for signing up for different services and the
percentage of this data is over 60% (4.11% of the whole data set). The second largest subcategory
which is being used in the ‘ham’ category is ‘personal’; its percentage is 22.5% (1.53% of the
whole data set). The least used subcategory of the ‘ham’ category belongs to ‘buying’; its
percentage is roughly around 17% (1.14% of the whole data set). Please refer to the table below
for the breakdown of the ‘ham’ category.
Category # of messages % of messages
Buying 3163 16.9%
Signing Up 11345 60.62%
Personal 4207 22.48%
Please refer to the graphical illustration of the ham category below:
International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 1, February 2014
The future work mainly consists of considering the messages that are considered to be in the
‘ham’ category. A closer look at those messages will provide an image of what are the main
purposes and premises of preferring using the DEA services over using st
for legitimate purposes. Answering the question “Can human readable filter be found among this
data to look at” will be answered by manual processing of the split data.
REFERENCES
[1] Smirnov Arthur Describing how the data obtained
Economics and Environmental Management
[2] Seigneur, J-M., and Christian Damsgaard Jensen. "Privacy recovery with disposable email addresses."
Security & Privacy, IEEE 1.6 (2003): 35
[3] Reed, Micheal G., Paul F. Syverson, and David M. Goldschlag. "Anonymous connections and onion
routing." Selected Areas in Communications, IEEE Journal on 16.4 (1998): 482
[4] Bradley, David (2009-05-13). "Spam or Ham?" Sciencetext. 2011
[5] Shields, Clay, and Brian Neil Levine. "A protocol for anonymous communication over the Internet."
Proceedings of the 7th ACM conference on Computer and communications security. ACM, 2000.
[6] Smirnov, Arthur. “Artificial Intelligence: Concepts and Applicable Uses.” Lam
Publishing (2013).
[7] Schneider, Karl-Michael. "A comparison of event models for Naive Bayes anti
Proceedings of the tenth conference on European chapter of the Association for Computational
Linguistics-Volume 1. Association for Computational Linguistics, 2003.
Ham Category Break
International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 1, February 2014
The future work mainly consists of considering the messages that are considered to be in the
‘ham’ category. A closer look at those messages will provide an image of what are the main
purposes and premises of preferring using the DEA services over using standard email accounts
for legitimate purposes. Answering the question “Can human readable filter be found among this
data to look at” will be answered by manual processing of the split data.
Smirnov Arthur Describing how the data obtained from DEA services is being used nowadays//
Economics and Environmental Management. – 2014
M., and Christian Damsgaard Jensen. "Privacy recovery with disposable email addresses."
Security & Privacy, IEEE 1.6 (2003): 35-39
heal G., Paul F. Syverson, and David M. Goldschlag. "Anonymous connections and onion
routing." Selected Areas in Communications, IEEE Journal on 16.4 (1998): 482-494
13). "Spam or Ham?" Sciencetext. 2011-09-28
, and Brian Neil Levine. "A protocol for anonymous communication over the Internet."
Proceedings of the 7th ACM conference on Computer and communications security. ACM, 2000.
Smirnov, Arthur. “Artificial Intelligence: Concepts and Applicable Uses.” Lambert Academic
Michael. "A comparison of event models for Naive Bayes anti-spam e-
Proceedings of the tenth conference on European chapter of the Association for Computational
ciation for Computational Linguistics, 2003.
Ham Category Break-down
Buying
Signing Up
Personal
International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 1, February 2014
154
The future work mainly consists of considering the messages that are considered to be in the
‘ham’ category. A closer look at those messages will provide an image of what are the main
andard email accounts
for legitimate purposes. Answering the question “Can human readable filter be found among this
from DEA services is being used nowadays//
M., and Christian Damsgaard Jensen. "Privacy recovery with disposable email addresses."
heal G., Paul F. Syverson, and David M. Goldschlag. "Anonymous connections and onion
, and Brian Neil Levine. "A protocol for anonymous communication over the Internet."
Proceedings of the 7th ACM conference on Computer and communications security. ACM, 2000.
bert Academic
-mail filtering."
Proceedings of the tenth conference on European chapter of the Association for Computational
Signing Up

Weitere ähnliche Inhalte

Was ist angesagt?

Thematic and self learning method for
Thematic and self learning method forThematic and self learning method for
Thematic and self learning method forijcsa
 
Spamming and Spam Filtering
Spamming and Spam FilteringSpamming and Spam Filtering
Spamming and Spam FilteringiNazneen
 
E Mail & Spam Presentation
E Mail & Spam PresentationE Mail & Spam Presentation
E Mail & Spam Presentationnewsan2001
 
AN EMPIRICAL ANALYSIS OF EMAIL FORENSICS TOOLS
AN EMPIRICAL ANALYSIS OF EMAIL FORENSICS TOOLSAN EMPIRICAL ANALYSIS OF EMAIL FORENSICS TOOLS
AN EMPIRICAL ANALYSIS OF EMAIL FORENSICS TOOLSIJNSA Journal
 
Machine Learning Project - Email Spam Filtering using Enron Dataset
Machine Learning Project - Email Spam Filtering using Enron DatasetMachine Learning Project - Email Spam Filtering using Enron Dataset
Machine Learning Project - Email Spam Filtering using Enron DatasetAman Singhla
 
Naming Disambiguation in Authors Database
Naming Disambiguation in Authors DatabaseNaming Disambiguation in Authors Database
Naming Disambiguation in Authors DatabaseMohammed Alsayyari
 
Tutorial 06 - Real-Time Communication on the Internet
Tutorial 06 - Real-Time Communication on the InternetTutorial 06 - Real-Time Communication on the Internet
Tutorial 06 - Real-Time Communication on the Internetdpd
 
Web 2.0: Making Email a Useful Web App
Web 2.0: Making Email a Useful Web AppWeb 2.0: Making Email a Useful Web App
Web 2.0: Making Email a Useful Web AppAndy Denmark
 
Electronic Mail Security (University of Jeddah, Saudi Arabia)
Electronic Mail Security (University of Jeddah, Saudi Arabia)Electronic Mail Security (University of Jeddah, Saudi Arabia)
Electronic Mail Security (University of Jeddah, Saudi Arabia)IJCSIS Research Publications
 
E-Mail Security Using Spam Mail Detection and Filtering System
E-Mail Security Using Spam Mail Detection and Filtering SystemE-Mail Security Using Spam Mail Detection and Filtering System
E-Mail Security Using Spam Mail Detection and Filtering Systemrahulmonikasharma
 
A multi layer architecture for spam-detection system
A multi layer architecture for spam-detection systemA multi layer architecture for spam-detection system
A multi layer architecture for spam-detection systemcsandit
 
Can domain intelligence help healthcare service providers combat data breaches
Can domain intelligence help healthcare service providers combat data breachesCan domain intelligence help healthcare service providers combat data breaches
Can domain intelligence help healthcare service providers combat data breachesWhoisXML API
 
A novel hybrid approach of SVM combined with NLP and probabilistic neural ne...
A novel hybrid approach of SVM combined with  NLP and probabilistic neural ne...A novel hybrid approach of SVM combined with  NLP and probabilistic neural ne...
A novel hybrid approach of SVM combined with NLP and probabilistic neural ne...IJECEIAES
 
Overview of Anti-spam filtering Techniques
Overview of Anti-spam filtering TechniquesOverview of Anti-spam filtering Techniques
Overview of Anti-spam filtering TechniquesIRJET Journal
 
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Editor IJCATR
 

Was ist angesagt? (19)

Thematic and self learning method for
Thematic and self learning method forThematic and self learning method for
Thematic and self learning method for
 
Spamming and Spam Filtering
Spamming and Spam FilteringSpamming and Spam Filtering
Spamming and Spam Filtering
 
E Mail & Spam Presentation
E Mail & Spam PresentationE Mail & Spam Presentation
E Mail & Spam Presentation
 
AN EMPIRICAL ANALYSIS OF EMAIL FORENSICS TOOLS
AN EMPIRICAL ANALYSIS OF EMAIL FORENSICS TOOLSAN EMPIRICAL ANALYSIS OF EMAIL FORENSICS TOOLS
AN EMPIRICAL ANALYSIS OF EMAIL FORENSICS TOOLS
 
Machine Learning Project - Email Spam Filtering using Enron Dataset
Machine Learning Project - Email Spam Filtering using Enron DatasetMachine Learning Project - Email Spam Filtering using Enron Dataset
Machine Learning Project - Email Spam Filtering using Enron Dataset
 
Naming Disambiguation in Authors Database
Naming Disambiguation in Authors DatabaseNaming Disambiguation in Authors Database
Naming Disambiguation in Authors Database
 
Tutorial 06 - Real-Time Communication on the Internet
Tutorial 06 - Real-Time Communication on the InternetTutorial 06 - Real-Time Communication on the Internet
Tutorial 06 - Real-Time Communication on the Internet
 
Email spam detection
Email spam detectionEmail spam detection
Email spam detection
 
Web 2.0: Making Email a Useful Web App
Web 2.0: Making Email a Useful Web AppWeb 2.0: Making Email a Useful Web App
Web 2.0: Making Email a Useful Web App
 
What is a spam ?
What is a spam ?What is a spam ?
What is a spam ?
 
Electronic Mail Security (University of Jeddah, Saudi Arabia)
Electronic Mail Security (University of Jeddah, Saudi Arabia)Electronic Mail Security (University of Jeddah, Saudi Arabia)
Electronic Mail Security (University of Jeddah, Saudi Arabia)
 
Spam Filtering
Spam FilteringSpam Filtering
Spam Filtering
 
E-Mail Security Using Spam Mail Detection and Filtering System
E-Mail Security Using Spam Mail Detection and Filtering SystemE-Mail Security Using Spam Mail Detection and Filtering System
E-Mail Security Using Spam Mail Detection and Filtering System
 
A multi layer architecture for spam-detection system
A multi layer architecture for spam-detection systemA multi layer architecture for spam-detection system
A multi layer architecture for spam-detection system
 
B0940509
B0940509B0940509
B0940509
 
Can domain intelligence help healthcare service providers combat data breaches
Can domain intelligence help healthcare service providers combat data breachesCan domain intelligence help healthcare service providers combat data breaches
Can domain intelligence help healthcare service providers combat data breaches
 
A novel hybrid approach of SVM combined with NLP and probabilistic neural ne...
A novel hybrid approach of SVM combined with  NLP and probabilistic neural ne...A novel hybrid approach of SVM combined with  NLP and probabilistic neural ne...
A novel hybrid approach of SVM combined with NLP and probabilistic neural ne...
 
Overview of Anti-spam filtering Techniques
Overview of Anti-spam filtering TechniquesOverview of Anti-spam filtering Techniques
Overview of Anti-spam filtering Techniques
 
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
 

Andere mochten auch

Website for more followers on keek
Website for more followers on keekWebsite for more followers on keek
Website for more followers on keekmandy365
 
Webviewer keek
Webviewer keekWebviewer keek
Webviewer keekmandy365
 
STABILITY ANALYSIS AND CONTROL OF A 3-D AUTONOMOUS AI-YUAN-ZHI-HAO HYPERCHAOT...
STABILITY ANALYSIS AND CONTROL OF A 3-D AUTONOMOUS AI-YUAN-ZHI-HAO HYPERCHAOT...STABILITY ANALYSIS AND CONTROL OF A 3-D AUTONOMOUS AI-YUAN-ZHI-HAO HYPERCHAOT...
STABILITY ANALYSIS AND CONTROL OF A 3-D AUTONOMOUS AI-YUAN-ZHI-HAO HYPERCHAOT...ijscai
 
Arista Villas- Phong cách Home resort Duy nhất tại HCMHotline: 0985 889 990 w...
Arista Villas- Phong cách Home resort Duy nhất tại HCMHotline: 0985 889 990 w...Arista Villas- Phong cách Home resort Duy nhất tại HCMHotline: 0985 889 990 w...
Arista Villas- Phong cách Home resort Duy nhất tại HCMHotline: 0985 889 990 w...TTC Land
 
Website to get keek followers free
Website to get keek followers freeWebsite to get keek followers free
Website to get keek followers freemandy365
 
Analysis of key considerations of the public when choosing recreational activ...
Analysis of key considerations of the public when choosing recreational activ...Analysis of key considerations of the public when choosing recreational activ...
Analysis of key considerations of the public when choosing recreational activ...ijcsit
 
Design and implementation of a personal super Computer
Design and implementation of a personal super ComputerDesign and implementation of a personal super Computer
Design and implementation of a personal super Computerijcsit
 
Šťastný marketing (Vladimír Přichystal)
Šťastný marketing (Vladimír Přichystal)Šťastný marketing (Vladimír Přichystal)
Šťastný marketing (Vladimír Přichystal)BESTETO
 
Webviewr keek
Webviewr keekWebviewr keek
Webviewr keekmandy365
 
A linear algorithm for the grundy number of a tree
A linear algorithm for the grundy number of a treeA linear algorithm for the grundy number of a tree
A linear algorithm for the grundy number of a treeijcsit
 
Performance Variations in Profiling Mysql Server on the Xen Platform: Is It X...
Performance Variations in Profiling Mysql Server on the Xen Platform: Is It X...Performance Variations in Profiling Mysql Server on the Xen Platform: Is It X...
Performance Variations in Profiling Mysql Server on the Xen Platform: Is It X...ijcsit
 
Website get more followers keek free
Website get more followers keek freeWebsite get more followers keek free
Website get more followers keek freemandy365
 
Websites to get more keek followers for free
Websites to get more keek followers for freeWebsites to get more keek followers for free
Websites to get more keek followers for freemandy365
 
MULTI-PARAMETER BASED PERFORMANCE EVALUATION OF CLASSIFICATION ALGORITHMS
MULTI-PARAMETER BASED PERFORMANCE EVALUATION OF CLASSIFICATION ALGORITHMSMULTI-PARAMETER BASED PERFORMANCE EVALUATION OF CLASSIFICATION ALGORITHMS
MULTI-PARAMETER BASED PERFORMANCE EVALUATION OF CLASSIFICATION ALGORITHMSijcsit
 

Andere mochten auch (15)

Website for more followers on keek
Website for more followers on keekWebsite for more followers on keek
Website for more followers on keek
 
Webviewer keek
Webviewer keekWebviewer keek
Webviewer keek
 
STABILITY ANALYSIS AND CONTROL OF A 3-D AUTONOMOUS AI-YUAN-ZHI-HAO HYPERCHAOT...
STABILITY ANALYSIS AND CONTROL OF A 3-D AUTONOMOUS AI-YUAN-ZHI-HAO HYPERCHAOT...STABILITY ANALYSIS AND CONTROL OF A 3-D AUTONOMOUS AI-YUAN-ZHI-HAO HYPERCHAOT...
STABILITY ANALYSIS AND CONTROL OF A 3-D AUTONOMOUS AI-YUAN-ZHI-HAO HYPERCHAOT...
 
Arista Villas- Phong cách Home resort Duy nhất tại HCMHotline: 0985 889 990 w...
Arista Villas- Phong cách Home resort Duy nhất tại HCMHotline: 0985 889 990 w...Arista Villas- Phong cách Home resort Duy nhất tại HCMHotline: 0985 889 990 w...
Arista Villas- Phong cách Home resort Duy nhất tại HCMHotline: 0985 889 990 w...
 
Website to get keek followers free
Website to get keek followers freeWebsite to get keek followers free
Website to get keek followers free
 
Analysis of key considerations of the public when choosing recreational activ...
Analysis of key considerations of the public when choosing recreational activ...Analysis of key considerations of the public when choosing recreational activ...
Analysis of key considerations of the public when choosing recreational activ...
 
Design and implementation of a personal super Computer
Design and implementation of a personal super ComputerDesign and implementation of a personal super Computer
Design and implementation of a personal super Computer
 
Šťastný marketing (Vladimír Přichystal)
Šťastný marketing (Vladimír Přichystal)Šťastný marketing (Vladimír Přichystal)
Šťastný marketing (Vladimír Přichystal)
 
Webviewr keek
Webviewr keekWebviewr keek
Webviewr keek
 
A linear algorithm for the grundy number of a tree
A linear algorithm for the grundy number of a treeA linear algorithm for the grundy number of a tree
A linear algorithm for the grundy number of a tree
 
Performance Variations in Profiling Mysql Server on the Xen Platform: Is It X...
Performance Variations in Profiling Mysql Server on the Xen Platform: Is It X...Performance Variations in Profiling Mysql Server on the Xen Platform: Is It X...
Performance Variations in Profiling Mysql Server on the Xen Platform: Is It X...
 
Website get more followers keek free
Website get more followers keek freeWebsite get more followers keek free
Website get more followers keek free
 
Websites to get more keek followers for free
Websites to get more keek followers for freeWebsites to get more keek followers for free
Websites to get more keek followers for free
 
MULTI-PARAMETER BASED PERFORMANCE EVALUATION OF CLASSIFICATION ALGORITHMS
MULTI-PARAMETER BASED PERFORMANCE EVALUATION OF CLASSIFICATION ALGORITHMSMULTI-PARAMETER BASED PERFORMANCE EVALUATION OF CLASSIFICATION ALGORITHMS
MULTI-PARAMETER BASED PERFORMANCE EVALUATION OF CLASSIFICATION ALGORITHMS
 
IT Governance
IT GovernanceIT Governance
IT Governance
 

Ähnlich wie Processing obtained email data by using naïve bayes learning algorithm

Email Marketing techniques
Email Marketing techniquesEmail Marketing techniques
Email Marketing techniquesJeroen Vos
 
Study of Various Techniques to Filter Spam Emails
Study of Various Techniques to Filter Spam EmailsStudy of Various Techniques to Filter Spam Emails
Study of Various Techniques to Filter Spam EmailsIRJET Journal
 
A LITERATURE REVIEW ON PHISHING EMAIL DETECTION USING DATA MINING
A LITERATURE REVIEW ON PHISHING EMAIL DETECTION USING DATA MININGA LITERATURE REVIEW ON PHISHING EMAIL DETECTION USING DATA MINING
A LITERATURE REVIEW ON PHISHING EMAIL DETECTION USING DATA MININGHeather Strinden
 
NetworkPaperthesis1
NetworkPaperthesis1NetworkPaperthesis1
NetworkPaperthesis1Dhara Shah
 
Running header EMAIL FORENSICSEMAIL FORENSICSEmail Forens.docx
Running header EMAIL FORENSICSEMAIL FORENSICSEmail Forens.docxRunning header EMAIL FORENSICSEMAIL FORENSICSEmail Forens.docx
Running header EMAIL FORENSICSEMAIL FORENSICSEmail Forens.docxjeffsrosalyn
 
Identification of Spam Emails from Valid Emails by Using Voting
Identification of Spam Emails from Valid Emails by Using VotingIdentification of Spam Emails from Valid Emails by Using Voting
Identification of Spam Emails from Valid Emails by Using VotingEditor IJCATR
 
Is Email Marketing Dead?
Is Email Marketing Dead?Is Email Marketing Dead?
Is Email Marketing Dead?SMTP Depot
 
MINIMIZING THE TIME OF SPAM MAIL DETECTION BY RELOCATING FILTERING SYSTEM TO ...
MINIMIZING THE TIME OF SPAM MAIL DETECTION BY RELOCATING FILTERING SYSTEM TO ...MINIMIZING THE TIME OF SPAM MAIL DETECTION BY RELOCATING FILTERING SYSTEM TO ...
MINIMIZING THE TIME OF SPAM MAIL DETECTION BY RELOCATING FILTERING SYSTEM TO ...IJNSA Journal
 
Presentation on Email phishing.pptx
Presentation on Email phishing.pptxPresentation on Email phishing.pptx
Presentation on Email phishing.pptxAbdulHaseebKhan34
 
Honeypot Projects are Everywhere
Honeypot Projects are EverywhereHoneypot Projects are Everywhere
Honeypot Projects are EverywhereChristos Beretas
 
Eloqua Grande Guide to Deliverability and Privacy
Eloqua Grande Guide to Deliverability and PrivacyEloqua Grande Guide to Deliverability and Privacy
Eloqua Grande Guide to Deliverability and PrivacyJoe Chernov
 
Seminar on web mail filter
Seminar   on   web mail filterSeminar   on   web mail filter
Seminar on web mail filterShalini Gs
 
Cyber security and emails presentation refined
Cyber security and emails presentation refinedCyber security and emails presentation refined
Cyber security and emails presentation refinedWan Solo
 
Deliverability webinar ppt show
Deliverability webinar ppt showDeliverability webinar ppt show
Deliverability webinar ppt showInformz
 
VOICE BASED EMAIL SYSTEM
VOICE BASED EMAIL SYSTEMVOICE BASED EMAIL SYSTEM
VOICE BASED EMAIL SYSTEMIRJET Journal
 

Ähnlich wie Processing obtained email data by using naïve bayes learning algorithm (20)

Email Marketing techniques
Email Marketing techniquesEmail Marketing techniques
Email Marketing techniques
 
Study of Various Techniques to Filter Spam Emails
Study of Various Techniques to Filter Spam EmailsStudy of Various Techniques to Filter Spam Emails
Study of Various Techniques to Filter Spam Emails
 
A LITERATURE REVIEW ON PHISHING EMAIL DETECTION USING DATA MINING
A LITERATURE REVIEW ON PHISHING EMAIL DETECTION USING DATA MININGA LITERATURE REVIEW ON PHISHING EMAIL DETECTION USING DATA MINING
A LITERATURE REVIEW ON PHISHING EMAIL DETECTION USING DATA MINING
 
NetworkPaperthesis1
NetworkPaperthesis1NetworkPaperthesis1
NetworkPaperthesis1
 
Running header EMAIL FORENSICSEMAIL FORENSICSEmail Forens.docx
Running header EMAIL FORENSICSEMAIL FORENSICSEmail Forens.docxRunning header EMAIL FORENSICSEMAIL FORENSICSEmail Forens.docx
Running header EMAIL FORENSICSEMAIL FORENSICSEmail Forens.docx
 
Identification of Spam Emails from Valid Emails by Using Voting
Identification of Spam Emails from Valid Emails by Using VotingIdentification of Spam Emails from Valid Emails by Using Voting
Identification of Spam Emails from Valid Emails by Using Voting
 
Is Email Marketing Dead?
Is Email Marketing Dead?Is Email Marketing Dead?
Is Email Marketing Dead?
 
MINIMIZING THE TIME OF SPAM MAIL DETECTION BY RELOCATING FILTERING SYSTEM TO ...
MINIMIZING THE TIME OF SPAM MAIL DETECTION BY RELOCATING FILTERING SYSTEM TO ...MINIMIZING THE TIME OF SPAM MAIL DETECTION BY RELOCATING FILTERING SYSTEM TO ...
MINIMIZING THE TIME OF SPAM MAIL DETECTION BY RELOCATING FILTERING SYSTEM TO ...
 
Presentation on Email phishing.pptx
Presentation on Email phishing.pptxPresentation on Email phishing.pptx
Presentation on Email phishing.pptx
 
Honeypot Projects are Everywhere
Honeypot Projects are EverywhereHoneypot Projects are Everywhere
Honeypot Projects are Everywhere
 
Research Report
Research ReportResearch Report
Research Report
 
Eloqua Grande Guide to Deliverability and Privacy
Eloqua Grande Guide to Deliverability and PrivacyEloqua Grande Guide to Deliverability and Privacy
Eloqua Grande Guide to Deliverability and Privacy
 
Seminar on web mail filter
Seminar   on   web mail filterSeminar   on   web mail filter
Seminar on web mail filter
 
spam
spamspam
spam
 
Cyber security and emails presentation refined
Cyber security and emails presentation refinedCyber security and emails presentation refined
Cyber security and emails presentation refined
 
SPAM FILTERS
SPAM FILTERSSPAM FILTERS
SPAM FILTERS
 
Deliverability webinar ppt show
Deliverability webinar ppt showDeliverability webinar ppt show
Deliverability webinar ppt show
 
Email Deliverability Guide For 2013
Email Deliverability Guide For 2013Email Deliverability Guide For 2013
Email Deliverability Guide For 2013
 
VOICE BASED EMAIL SYSTEM
VOICE BASED EMAIL SYSTEMVOICE BASED EMAIL SYSTEM
VOICE BASED EMAIL SYSTEM
 
M dgx mde0mde=
M dgx mde0mde=M dgx mde0mde=
M dgx mde0mde=
 

Kürzlich hochgeladen

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 

Kürzlich hochgeladen (20)

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 

Processing obtained email data by using naïve bayes learning algorithm

  • 1. International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 1, February 2014 DOI : 10.5121/ijcsit.2014.6111 147 Processing Obtained Email Data By Using Naïve Bayes Learning Algorithm Valery V. Pelenko, Aleksandr V. Baranenko St. Petersburg Research University of Information Technologies, Mechanics and Optics. Institute of Refrigeration and Biotechnology 9 Lomonosova St., St Petersburg, 191002, Russia ABSTRACT This paper gives a basic idea how various machine learning techniques may be applied towards processing the data from DEA services to find out whether people use these services for legitimate or non-legitimate purposes. KEYWORDS Spam, ham, emails, communications, machine learning, Naïve Bayes, DEA services. 1. INTRODUCTION In this paper we are characterizing a set of found email data using a method of supervised learning – Naïve Bayes. There are lots of DEA services, but we decided to get the data from most commonly used DEA services [1]. The data set was chosen from five independent disposable email address (DEA) services: dispostable, mailinator, mytrashmail, staging and tempemail. The DEA allow users to receive the emails without creating an account. Not much is known how people use these services; in fact, there are so many legitimate and non-legitimate purposes that people use these services for. Since the accounts don’t require the password to access them, this data becomes public. Some users may not realize how public this data are and that anyone else can purposely or accidentally access this private information. We decided to go ahead and categorize the found email data and split it up in four main categories. Our goal is to find out how much of the information that is stored in the DEA services is legitimate. At the same time, the main purpose for this paper was to break down the data usage and present the statistics of the DEA services. We found that about 72% of the data was tagged ‘spam’ comparing to only 6.7% of legitimate data which was tagged ‘ham’. The remaining part was split unevenly between 6.2% of ‘other’ and 15.1% of ‘Non English’. We used Naïve Bayes to split the data into the categories mentioned above. During thus project, we were dealing with the data that was obtained from five various independent DEA services. DEA services are such services that do not require a user to create an account in order to receive an email to any email address supported by the service. Thus, all the data becomes public due to no identification required. Since the DEA services are used by many users from all over the world for a variety of purposes, then the data that was obtained such services is obviously to be very diverse. It is very important to understand that users may use DEA services for both legitimate and non-legitimate purposes [2]. Since all the data is public and
  • 2. International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 1, February 2014 148 some users use the DEA services for legitimate purposes, these users may not realize how public the data are and what risk they take. This is the main factor why DEA services may not be very helpful as they seem to be at first. In order to protect personal information and avoid an identity theft, the users should be informed beforehand what risk they take. Some non-legitimate users may find this personal data such as: name, address, SSN, credit card number, and others very useful and the victim can suffer adverse consequences if they are held accountable for the perpetrator's actions. The main purpose of the paper was to characterize this found email data. Once we are done splitting the data into different categorize, we can look at the informative data. The informative data would be categorized as ‘ham’. A closer look at the messages in the ‘ham’ category will benefit us in some ways such are presented below: • Understand what people use DEA services for legitimate purposes. • What did force these users use DEA service over the standard email account? • Is there any risk that people take when using DEA services for legitimate purposes, if there is, how high is that risk? After processing all the data, We will be able to demonstrate how much of this data are used for legitimate purposes, non-legitimate purposes and any other purposes (if any); show whether the users take any risk of loss of any personal information when using DEA services 2. FOUND EMAIL DATA SETS Disposable email addressing (DEA) is a way of sharing and managing email addressing [3]. DEA allows user to set up a new, unique email address for every contact or use an already existing email address that only requires having a username to access the existing account in order to make a connection between the sender and the recipient. Our main idea is that the DEA services are mostly used by people who try to avoid using their personal email accounts due to various factors including receiving spam [3,4]. Thus, these people prefer to use DEA services to stay anonymous. The main advantage of using the DEA services is that there is no need to register by using your real credentials and, of course, these services are free of charge. The user is given a choice to select any name for the email address and use it, even if it was used before [5]. As any other service, there are also disadvantages of using the DEA services. If this account has been used before, then the user will be able to track all the messages in this account and some private information may become public. Many forums and legitimate services filter out messages sent from DEA domains. 3. METHODOLOGY The primary work for this paper was to characterize the found email data from various email services listed below: • Dispostable • Mailinator • Mytrashmail • Staging • Tempemail
  • 3. International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 1, February 2014 149 We have competed two separate stages of characterizing the data. We were mainly interested in breaking down the category of “ham” email versus “spam” category. The categories used at the first and second stages are listed below: Stage 1: • Spam The messages considered to be under the category of spam if it’s obvious that the recipient has no pre-existing relations with the sender. For example, any message that looks like an advertisement and it was sent to thousands of people simultaneously, this message will be considered as spam [4]. • Ham The messages considered to be under the category of ham if the user seems to have pre- existing relation such as the message seems to indicate a specific action taken by the person to cause it (an invoice for purchased products, or a response mentioning that a purchase could not go through etc.) will be considered ham [4]. • Non English The messages considered to be non English if the original language in what the message was written is other than English, e.g., Russian, Spanish, Italian, Chinese, Arabic, Hebrew, Ukrainian (the listed languages were found in the found email data set). • Other The messages considered to be under the category of other if the messages couldn’t be assigned to any other category mentioned above. For example, an email with no text in it and a plain black background would be described as other. • Errors The messages considered to be under the category of other if the messages couldn’t be parsed by program or broken files. Stage 2: • Buying The messages considered to be under the category of buying if the email that was received by the recipient mentions that he or she has purchased something. For example, an email received from paypal service would be considered as buying email. • Signing Up The messages considered to be under the category of signing up if the email that was received by the percipient shows that the user has subscribed to some service. For example, an email received from the forum web sites is considered as signing up email.
  • 4. International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 1, February 2014 150 • Non Ham (Spam) The definition provided above in stage 1. • Personal The messages considered to be under the category of personal if the user has requested a subscription from the social websites. For example, an email received from Linkedin is considered as a personal email. The found email data, as mention above, was taken from five independent DEA services. Please refer to the table to see what the size of the data was from each service. DEA Service Size of the data (GB) % of data Dispostable 2.1 24.21% Mailinator 0.009 0.010% Mytrashmail 0.22 0.253% Staging 0.095 1.095% Tempemail 6.25 74.432% Overall 8.674 100% There are two main sets of machine learning techniques that can be used for classifying the data: supervised and unsupervised learning [6]. The supervised learning is when the data is tagged before the algorithm makes any further decisions. On the other hand, if there is no input to the algorithm, then unsupervised learning has to be used. Algorithms in the group of unsupervised learning find the similarities and/or correlations in the data and require no input to classify the data. There was a choice of using either supervised or unsupervised learning. Since the data set of found email data was 8.67 GB, WE decided to use supervised learning. Naïve Bayes method was chosen to be used as a method for supervised learning [6]. For the purposes of the describing the data, we decided to use the Naïve Bayes algorithm which is suitable for handling big sets of data, in our example the data set is 8.764 GB and diverse data. A Naïve Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem with naive independence assumptions. In simple terms, a naive Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a naive Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple. Depending on the precise nature of the probability model, Naïve Bayes classifiers can be trained very efficiently in a supervised learning setting [6]. In many practical applications, parameter estimation for Naïve Bayes models uses the method of maximum likelihood; in other words, one can work with the Naïve Bayes model without believing in Bayesian probability or using any Bayesian methods [7].
  • 5. International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 1, February 2014 Abstractly, the probability model for a classifier is a conditional model: dependent class variable C with a small number of outcomes or classes, conditional on several feature variables 1F through nF . The problem is that if the number of features a feature can take on a large number of values, then basing such a model on probability tables is infeasible [6]. We therefore reformulate the model to make it more tractable. Using Bayes' theorem, this can be written as following: ,...,( ,...,()( ),...,|( 1 1 1 n Fp FpCp FFCP = For the purpose of understanding, in simple English the equation provided above can be written as following: evidence likelihoodprior posterior * = 4. RESULTS Due to two factors such as diversity of the data and the size of the data, the algorithm was run twice in order to split the data. At the first stage the following tags were introduced into the model. The description of each tag is provided in the introduction section. The size of data considered at the first stage was 8.76 GB. 1. Spam 2. Ham 3. Non English 4. Other 5. Errors After training the algorithm, the size of the database which contains the words and its probabilities was 1878 KB (234 messages). The results obtained after the first stage are as following: 83726 0 50000 100000 150000 200000 250000 300000 350000 400000 Non English International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 1, February 2014 Abstractly, the probability model for a classifier is a conditional model: ,...,|( 1FCP with a small number of outcomes or classes, conditional on several . The problem is that if the number of features n is large or when mber of values, then basing such a model on probability tables is therefore reformulate the model to make it more tractable. Using Bayes' theorem, this can be written as following: ),..., )|,..., n n F CF understanding, in simple English the equation provided above can be written likelihood Due to two factors such as diversity of the data and the size of the data, the algorithm was run data. At the first stage the following tags were introduced into the model. The description of each tag is provided in the introduction section. The size of data considered at the first stage was 8.76 GB. the algorithm, the size of the database which contains the words and its probabilities was 1878 KB (234 messages). The results obtained after the first stage are as 77090 33559 347191 11418 Ham Other Spam Errors # of messages Stage 1 - Results International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 1, February 2014 151 ),..., nF over a with a small number of outcomes or classes, conditional on several is large or when mber of values, then basing such a model on probability tables is understanding, in simple English the equation provided above can be written Due to two factors such as diversity of the data and the size of the data, the algorithm was run data. At the first stage the following tags were introduced into the model. The description of each tag is provided in the introduction section. The size of data the algorithm, the size of the database which contains the words and its probabilities was 1878 KB (234 messages). The results obtained after the first stage are as
  • 6. International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 1, February 2014 Since the main idea was to break down the category of ham email, the algorithm at the second stage filtered out the emails that were tagged as ‘Ham’ after completing the first stage. In order to improve the accuracy of the algorithm, at the second stage. The list of the tags is introduced below (the description of the tags in provided in the introduction section). The size of data considered at the second stage was 1.24 GB. 1. Buying 2. Signing Up 3. Non Ham (Spam) 4. Personal After training the algorithm, the size of the database which contains the words and its probabilities was 1620 KB (202 messages). The results obtained after the first stage are as following: For the purposes of easier understanding, category break-down: 0 10000 Buying Signing Up Non Ham (Spam) Personal 68.31% 7.12% International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 1, February 2014 Since the main idea was to break down the category of ham email, the algorithm at the second stage filtered out the emails that were tagged as ‘Ham’ after completing the first stage. In order to improve the accuracy of the algorithm, we had to include the ‘Spam’ tag into the list of tags used at the second stage. The list of the tags is introduced below (the description of the tags in provided in the introduction section). The size of data considered at the second stage was 1.24 After training the algorithm, the size of the database which contains the words and its probabilities was 1620 KB (202 messages). The results obtained after the first stage are For the purposes of easier understanding, the graph below shows percentage values of the 10000 20000 30000 40000 50000 Stage 2 - Results # of messages 5.36% 19.21% 7.12% Buying Signing Up Non Ham (Spam) Personal International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 1, February 2014 152 Since the main idea was to break down the category of ham email, the algorithm at the second stage filtered out the emails that were tagged as ‘Ham’ after completing the first stage. In order to ‘Spam’ tag into the list of tags used at the second stage. The list of the tags is introduced below (the description of the tags in provided in the introduction section). The size of data considered at the second stage was 1.24 After training the algorithm, the size of the database which contains the words and its probabilities was 1620 KB (202 messages). The results obtained after the first stage are the graph below shows percentage values of the Non Ham (Spam)
  • 7. International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 1, February 2014 153 Due to the big amount of data, the algorithm accuracy was computer using 50 messages as following: %100* 50 # sagesMatchedMes Accuracy = ; the message is considered to be matched if the tag assigned by human is the same as assigned by the Naïve Bayes Classifier. %82%100* 50 41 ==Accuracy 5. CONCLUSION After completing all the stages of the algorithm it’s obvious that mostly the DEA services are used for ‘spam’, which is more than 70%. The minority of usage belongs to the ‘ham’ category; it’s only 6.7% of all the data. The other small usage goes under the category ‘other’; it’s only 6.2%. It’s important to note that a little less 1/5 of the data is used by foreigners (i.e. in language other than English). The detailed breakdown is provided below: Category # of messages % of messages Non English 83726 15.13% Ham 36746 6.66% Other 33559 6.07% Spam 387535 70.1% Errors 11418 2.04% The main purpose of the paper was to classify and breakdown the ‘ham’ category of emails, but it couldn’t be done without completing two stages of the algorithm and split out the ‘spam’ data at the second stage. Mostly the ‘ham’ emails are used for signing up for different services and the percentage of this data is over 60% (4.11% of the whole data set). The second largest subcategory which is being used in the ‘ham’ category is ‘personal’; its percentage is 22.5% (1.53% of the whole data set). The least used subcategory of the ‘ham’ category belongs to ‘buying’; its percentage is roughly around 17% (1.14% of the whole data set). Please refer to the table below for the breakdown of the ‘ham’ category. Category # of messages % of messages Buying 3163 16.9% Signing Up 11345 60.62% Personal 4207 22.48% Please refer to the graphical illustration of the ham category below:
  • 8. International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 1, February 2014 The future work mainly consists of considering the messages that are considered to be in the ‘ham’ category. A closer look at those messages will provide an image of what are the main purposes and premises of preferring using the DEA services over using st for legitimate purposes. Answering the question “Can human readable filter be found among this data to look at” will be answered by manual processing of the split data. REFERENCES [1] Smirnov Arthur Describing how the data obtained Economics and Environmental Management [2] Seigneur, J-M., and Christian Damsgaard Jensen. "Privacy recovery with disposable email addresses." Security & Privacy, IEEE 1.6 (2003): 35 [3] Reed, Micheal G., Paul F. Syverson, and David M. Goldschlag. "Anonymous connections and onion routing." Selected Areas in Communications, IEEE Journal on 16.4 (1998): 482 [4] Bradley, David (2009-05-13). "Spam or Ham?" Sciencetext. 2011 [5] Shields, Clay, and Brian Neil Levine. "A protocol for anonymous communication over the Internet." Proceedings of the 7th ACM conference on Computer and communications security. ACM, 2000. [6] Smirnov, Arthur. “Artificial Intelligence: Concepts and Applicable Uses.” Lam Publishing (2013). [7] Schneider, Karl-Michael. "A comparison of event models for Naive Bayes anti Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics-Volume 1. Association for Computational Linguistics, 2003. Ham Category Break International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 1, February 2014 The future work mainly consists of considering the messages that are considered to be in the ‘ham’ category. A closer look at those messages will provide an image of what are the main purposes and premises of preferring using the DEA services over using standard email accounts for legitimate purposes. Answering the question “Can human readable filter be found among this data to look at” will be answered by manual processing of the split data. Smirnov Arthur Describing how the data obtained from DEA services is being used nowadays// Economics and Environmental Management. – 2014 M., and Christian Damsgaard Jensen. "Privacy recovery with disposable email addresses." Security & Privacy, IEEE 1.6 (2003): 35-39 heal G., Paul F. Syverson, and David M. Goldschlag. "Anonymous connections and onion routing." Selected Areas in Communications, IEEE Journal on 16.4 (1998): 482-494 13). "Spam or Ham?" Sciencetext. 2011-09-28 , and Brian Neil Levine. "A protocol for anonymous communication over the Internet." Proceedings of the 7th ACM conference on Computer and communications security. ACM, 2000. Smirnov, Arthur. “Artificial Intelligence: Concepts and Applicable Uses.” Lambert Academic Michael. "A comparison of event models for Naive Bayes anti-spam e- Proceedings of the tenth conference on European chapter of the Association for Computational ciation for Computational Linguistics, 2003. Ham Category Break-down Buying Signing Up Personal International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 1, February 2014 154 The future work mainly consists of considering the messages that are considered to be in the ‘ham’ category. A closer look at those messages will provide an image of what are the main andard email accounts for legitimate purposes. Answering the question “Can human readable filter be found among this from DEA services is being used nowadays// M., and Christian Damsgaard Jensen. "Privacy recovery with disposable email addresses." heal G., Paul F. Syverson, and David M. Goldschlag. "Anonymous connections and onion , and Brian Neil Levine. "A protocol for anonymous communication over the Internet." Proceedings of the 7th ACM conference on Computer and communications security. ACM, 2000. bert Academic -mail filtering." Proceedings of the tenth conference on European chapter of the Association for Computational Signing Up