Weitere ähnliche Inhalte
Ähnlich wie A comparative analysis of data mining tools for performance mapping of wlan data
Ähnlich wie A comparative analysis of data mining tools for performance mapping of wlan data (20)
Mehr von IAEME Publication
Mehr von IAEME Publication (20)
Kürzlich hochgeladen (20)
A comparative analysis of data mining tools for performance mapping of wlan data
- 1. INTERNATIONALComputer Engineering and2,Technology ENGINEERING
International Journal of
JOURNAL OF COMPUTER (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue March – April (2013), © IAEME
& TECHNOLOGY (IJCET)
ISSN 0976 – 6367(Print)
ISSN 0976 – 6375(Online) IJCET
Volume 4, Issue 2, March – April (2013), pp. 241-251
© IAEME: www.iaeme.com/ijcet.asp
Journal Impact Factor (2013): 6.1302 (Calculated by GISI)
©IAEME
www.jifactor.com
A COMPARATIVE ANALYSIS OF DATA MINING TOOLS FOR
PERFORMANCE MAPPING OF WLAN DATA
Mr. Ajay M. Patel
Assistant Professor,
Acharya Motibhai Patel Institute of Computer Studies, Ganpat University,
Ganpat Vidyanagar-384012, India
Dr. A. R. Patel
Director, Department of Computer Application & Information Technology,
H. North Gujarat University,
Patan - 384265, India
Ms. Hiral R. Patel
Assistant Professor, Department of Computer Science,
Ganpat University,
Ganpat Vidyanagar-384012, India
ABSTRACT
Data Mining is the non-trivial process of identifying valid, potentially and
understandable patterns in the form of knowledge discovery from the large volume of data.
The main aim of this process is to discovering patterns and associations among preprocessed
and transformed data. Data mining is used for two type of analysis: Prediction and
description. Prediction in terms of predicts unknown or future values of selected variables.
Description in terms of describes human interpretable patterns. The major application areas
such as business and finance, stock market, telecommunications, health care, surveillance,
fraud detection, scientific discovery and now a day’s extensive usage in networking. Data
mining supports supervised and unsupervised type of machine learning process. This paper
uses the unsupervised learning process of data mining. For that the paper uses the wireless
network log as a data set which has 13 attributes with 1000 instances for anomaly detection.
The research focuses on the performance mapping of different unsupervised algorithm
supported by different data mining tools. The different tool provides different types of
clustering algorithm with different performance mapping measures. The same data set
applied for different tools. This paper shows the comparative analysis for performance of
algorithms of on different data mining tools.
241
- 2. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
Index Terms: Accuracy, Anomaly Detection, Clustering, Data Mining, Error Rate,
Unsupervised Learning.
1. INTRODUCTION
The mining is a machine learning process for detecting unknown patterns from the
data. The data mining provides many useful analytical techniques. This research shows the
usage of data mining techniques for anomaly detection in wireless networking. The most
obvious advantage of wireless networking is mobility. Wireless network users can connect to
existing networks and are then allowed to roam freely. In next generation wireless networks,
one of the most serious challenges is how to achieve continuous connection during mobile
user movement among cells which is allowed due to handover procedure. An Intrusion
prevention system (IPS) is software that has all the capabilities of an intrusion detection
system and can also attempt to stop possible incidents. An intrusion prevention system (IPS)
combines IDS with a firewall, a virus detection algorithm, a vulnerability assessment
algorithm, etc. The ambition of such a system is to manage both preventive and responsive
actions against attacks on a computer network. [10] The wireless log history hides this useful
knowledge patterns that describe typical behavior of anomalies in packet transmission. [5] In
network security research, Intrusion Detection is a dangerous concern. Misuse detection and
Anomaly detection are the two basic approaches of intrusion detection. Intrusion Detection
System is accrues and examines the data to be aware of the intrusions and mishandlings in the
computer system and network. [7] So data mining provides various types of technologies
available to find out these types of anomaly intrusion activities.
1.1 Data Mining
Data mining is a machine learning technique which provides different techniques to
find out the knowledge and unknown patterns from raw data. Data mining is up-and-coming
with the key features of much security inventiveness. Both the private and public sectors are
currently increasingly usage the data mining. Many application domains such as banking,
insurance, medicine, and retailing frequently use data mining to reduce costs, enhance
research, and increase sales. Data mining applications initially were used as a means to detect
fraud and waste, but have grown to also be used for purposes such as measuring and
improving program performance. Data mining involves the use of sophisticated data analysis
tools to discover previously unknown, valid patterns and relationships in large data sets. The
Data Mining tools can include statistical models, mathematical algorithms, and machine
learning methods. An algorithm improves the performance automatically through experience,
such as neural networks or decision trees. Data mining exploits a discovery approach, in
which algorithms can be used to scrutinize several multidimensional data relationships
concurrently, discovering those that are unique or frequently represented. Data mining has
become increasingly common in both the public and private sectors. Many Organizations
provide data mining tools to survey different user work oriented information and gives
analytical results to interpret so these tools reduce fraud and waste of time to assist in
developing algorithms for research. But it is possible and preferable way to use or modify the
algorithms as per the requirements. Recently, data mining has been gradually more cited as an
imperative tool for various security efforts. Some observers suggest that data mining should
be used as a means to identify terrorist or intrusive activities, such as money transfers and
electronic communications, and to identify and track individual terrorists or intruders
themselves, such as through travel and immigration records. [9]
242
- 3. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
1.2 Why Unsupervised Learning?
Data mining is the process of extracting knowledge from a database. Data mining
models can be categorized according to the tasks they perform. Data mining techniques are
predictive (supervised) or descriptive (unsupervised) techniques. Classification Prediction,
Clustering, Association Rules are the data mining techniques from which Classification and
prediction is a supervised learning models, but clustering and association rules are descriptive
models. Classification recognizes patterns that describe the group to which an item belongs.
Prediction is the construction and use of a model to assess the class of an unlabeled object or
to assess the value or value ranges of a given object is likely to have. A supervised learning
model provides the way to classify the data as per pre defined given class label. Unsupervised
learning provides a way to classify the data as per the behavior of the data. In unsupervised
learning techniques treats all variables in the same way, there is no distinction between
descriptive and dependent variables. However, in contrast to the name undirected data mining
there is still some target to achieve. This target might be as general as data reduction or more
specific like clustering. The difference between supervised learning and unsupervised
learning is same as that distinguishes discriminant analysis with cluster analysis. Supervised
learning necessitates the target variable is well defined and that a sufficient number of its
values are given. For unsupervised learning typically either the target variable is unknown or
has only been recorded for too small a number of cases.
1.3 Intrusion Detection in WLAN
A wireless IDPS monitor’s the wireless network traffic and investigate its wireless
networking protocols to identify suspicious activity perform by the user and detected by
protocols themselves. This section provides a detailed discussion of wireless IDPS
technologies. First, it contains a brief overview of wireless networking, which is background
material for understanding the rest of the section. It covers the major components of wireless
IDPSs and gives the explanation the architectures typically used for deploying the
components. It also examines the security capabilities of the technologies in depth, including
the methodologies they use to identify and stop suspicious activity. The rest of the section
discusses the management capabilities of the technologies, including recommendations for
implementation and operation. [10] Wireless intrusion detection systems can be divided into
misuse based and anomaly based systems in the same way as the IDS for wired networks.
Beside classical misuse and anomalies detectable in any network, wireless IDS must also
detect wireless specific misuse and anomalies. Machine learning is regarded as an effective
tool utilized by intrusion detection system (IDS) to detect abnormal activities from network
traffic. In particular, neural networks, support vector machines (SVM) and decision trees are
three significant and popular schemes borrowed from the machine learning community into
intrusion detection in recent academic research. [7]
1.4 Anomaly Detection
Anomaly is any happening or entity that is eccentric, abnormal or special. It can also
indicate an inconsistency or divergence from the preset rule or tendency. A normal behavior is
modeled for anomaly detection. Any proceedings which contravene this model will be
marked as suspicious. For example, a normal passive public web can be considered to give
rise to worm infection if it tries to open connections to a large number of addresses. An
243
- 4. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
Anomaly Based Intrusion Detection System is a system for finding the intrusions and misuse
in the computer by monitoring the system activity and classifies the activities as normal or
anomalous. This system will detect any type of misuse that falls out of the normal system
operation since the classification is completely based on rules or heuristics, rather than
patterns or signatures. Anomaly based detection system seeks deviations from the learned
model of normal behavior. An anomaly based IDS analyze the ongoing traffic, activity,
transactions or behaviors for detecting anomalies in the system or the network which may be
indicative of any attack. An Intrusion Detection System (IDS) is a program that examines
what happens or has happened during an execution and endeavor to find suggestions that the
computer has been misuse. The development of anomaly detection techniques suitable for
Wireless Networks is regarded as a vital research area. [7]
2. DATA MINING TECHNIQUES FOR ANOMALY DETECTION
Anomaly detection means any significant deviations from the expected behavior are
reported as possible attacks. Data mining provides various techniques to find out the
knowledge from the data. Anomalies are some type of activities that would be performs by
intruders. Anomaly detection is the process of finding the objects that are not related to other
normal objects. Data mining provides the techniques to find out such a groups or classes as
per the requirement and the usage of the work. Classification is used to classify the data
gathered from the different collected data. Data mining also provides another technique that
is clustering. Clustering is also used to grouping the data as per the behavior of the data. So
data mining techniques are useful to find out the groups or classes. These classes or groups
are useful to differentiate the other dissimilar groups as per the predefined labels or the
behavior of data.
3. PROCESS OF UNSUPERVISED LEARNING (CLUSTERING)
Unsupervised learning is the method of grouping the data as per behavior of data. It is
also known as descriptive method. Clustering is one of the unsupervised learning techniques.
Clustering works on the data directly no any predefined label are required. Clustering also
executes or gives the different groups as per the user wants to generate. Clustering techniques
generate the groups as per the distance criteria among the data. There are different distance
measure methods are available to count the distance amount the instances. Different
clustering provider tools use different distance measure to grouping the data. The accuracy of
the results are depends on the algorithms used to clustering the instances. This paper shows
the usage of different tools of data mining. The clustering techniques are applied on same
wireless log of data to perform comparative analysis to describe which tool gives more
accurate results.
4. DATA MINING TOOLS USED FOR PERFORMANCE ANALYSIS
There are various organizations provide data mining tools to perform the data mining
techniques. Some of tools are freeware and open source so any one can easily use them. Data
mining tools provides inbuilt algorithms for various data mining techniques. In this paper,
Different types of data mining tools are used like Weka, SPSS, Tanagra and Microsoft SQL
Server Provides Business Intelligence Development Studio for to support data mining
244
- 5. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
analysis services. Here in this paper three clusters are generated and defined as “Normal
activities”, “Suspicious activities” and “Animalized activities”. These all different tools’ different
clustering algorithm applied on same wireless log file to find out animalized group of activities.
Different tools have different results. The important thing is that to interpret the results of the
applied techniques. The closed instances are put in to the same cluster and the closeness of the
instances is measured by to finding out the distances. So clusters are generated based on this
policy. Data mining unsupervised technique model is best suitable but different tools uses
different way of finding the distances so to define ideal model is depend on the accuracy and
error rate provided by the algorithm of the tools. The following shows the steps to perform data
mining techniques using different tools.
4.1 WEKA
The full form of Weka is W (aikato) E (nvironment) for K (nowlegde) A (nalysis). Weka
is open source tool because it is designed using Java. It provides various data mining techniques.
It provides the facility to perform preprocessing task and user is able to develop or change the
inbuilt algorithms using weka. Weka works with different file formats like .arff, csv, C4.5, .xrff
etc. In this paper Weka 3.7 is used to apply Simple Kmean for 3 clusters on Wireless log based on
Euclidean distance because it is sufficient to group similar instances.
Figure 1: Clustering using Weka 3.7.
4.2 SPSS
SPSS is specially designed to perform statistical analysis proprietary product from IBM.
It provides various statistical test analyses and also provides data mining techniques. SPSS works
with .sav file and other database file like excel. In this paper, SPSS 16.0 is used to apply Kmean
Clustering from Analyze-> Classify tab. This model also generates the 3 clusters. They are using
two methods iterative with classify and only classify. It’s also performing ANOVA for statistical
verification.
Figure 2: Clustering using SPSS 16.0
245
- 6. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
4.3 Tanagra
Tanagra is also freely available data mining tool. It provides various statically, Non
parametric test, Spv Leaning techniques association and clustering. Tanagra works with .arff
and other file format specified by Tanagra. Here Tanagra 1.4.43 is used. It is component
based visualize tool. It generates 3 clusters for wireless log. Tanagra uses distance
normalization based on variance and find the seed based on random or standard way specify
by it.
Figure 3: Clustering using Tanagra 1.4.4
4.4 BIDS of MS SQL Server 2008
Microsoft also provides the data mining tool which is known as MS SQL Server 2008
which provides business intelligent development studio. This tool provides various only data
mining effective algorithms which provide scalable results. These algorithms generally
applied on the data stored in SQL Server. In this paper Microsoft Clustering algorithm is used
to generate 3 clusters for same wireless log. This tool use the pure algorithm defined by
Microsoft and as per the data log user can specify the key measurement, inputs and
predictable attribute with number of cluster and as per measurement it will calculate
clustering and also suggest the user as per statistical testing to provide better result.
Figure 4: Clustering using MS SQL Server 2008
246
- 7. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
5. RESULT INTERPRETATION
Now a day’s various organizations provide different tools which support different
analytical techniques but the main important thing is to interpret the results. In this paper
different tools are used on same wireless log but gives different results. The three clusters are
categorized as Normal activities cluster, another activities cluster and animalized activities.
5.1 Results using WEKA
Weka performs the simple kmean algorithm to clusterize the wireless log. It is
perform the clustering on predefined data set or also user able to provide the test data set.
Weka provides four types of distance measure functions to generate the similar instance type
clusters. For this log Euclidean Distance function is used. It will generate 3 clusters as per the
distance. As per the figure 15% of instances show the anomaly activities, 44% as Normal
activities and rest of defined as Suspicious activities.
Clustered Instances Result
Cluster Clustered Instances
0 409 ( 41%)
1 440 ( 44%)
2 150 ( 15%)
Figure 5: Results of Weka
5.2 Results using SPSS
SPSS performs clustering as per the above considerations it will perform the iterative
classification and define 25% of shows Anomaly activities and 25% suspicious activities with
50% definition of normal activities. SPSS used for to perform statistical analysis of given
data log. Its show the ANOVA table which represent the normality and the data significance
for the given log.
The results also represent the distance matrix of the clusters. This show the
distance between clusters one and cluster two is very small compared to the cluster three.
This interpreted as the instance of the cluster three are most different from the others. That
means, the cluster three have the different behavior activities which not perform normal
activities. That’s the reason the cluster three have the animalized activities which is intrusive
because intrusive events are the events which disturb the normal behavior of the network.
247
- 8. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
Figure 6: Results of SPSS
5.3 Results using Tanagra
The clustering is used to generate homogeneous subgroups of instances. As per Tanagra
the accuracy of the model depends on the TSS (Total sum of squares), WSS (Within sum of
squares) and BSS (Between Sum of squares). On the basis of TSS and WSS, BSS is calculated.
BSS and Result Ratio calculated using following.
BSS = TSS – WSS [34326.92=39992.00-5665.077]
Result Ratio = BSS / TSS [0.85=34326.92/39992.00]
This result shows the individual groups classification which represent the no of instances
in 3 different clusters is not much differ in ratio.
Figure 7: Results of Tanagra
248
- 9. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
5.4 Results using BIDS of MS SQL Server 2008
MS SQL Server 2008 is also provides the facility to perform data mining task. This tool is
produced by Microsoft. It provides effective mining algorithm. As per the results it creates the
clusters automatically as per the behavior of the data. The result also contains the lift chart and
accuracy chart. It’s also display the discriminate statistical analysis. This tool gives the prediction
model with its proving result. The lift chart of the model shows the overall accuracy of the model
in terms of statistics, Data analysis and model performance. For this log it shows the linear lift
chart with statistical measurement. As per all the results this tool gives most accurate results
because it also shows the statistics for given results as per shown in below.
Figure 8: Results of BIDS
249
- 10. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
The results shows clustering statistics and also shows the clustering which is given as
per the behavior of the data. The each cluster shows the density by the instances come up
with it. This tool also provides the statistics of the how each instance’s distance with the same
cluster as well as others. The clustering of the BIDS is more flexible because it uses EM, K-
Mean and scalable or non scalable methods of grouping. The Cluster diagram shows the
characteristics of each and every clusters. The strength of the similarity of the clusters
represented by the shading of lines connected among the clusters. The light shading the
clusters denotes that these clusters are not very similar. So as per this model of Cluster
diagram cluster number eight, nine and ten represented with light shading so they have
instances that is not much similar to the others. So the instances belongs to that cluster shows
the anomalous activities. The cluster number five six and seven represented with average
shading so it’s interpreted as the instances of these clusters are suspicious. The remaining
clusters are purely highlighted so they have normal behavioral instances. The model gives
16% density which is accurate by calculating the ratio of number of instances in each cluster
with the overall instances in the log. So its gives ideal model to identify each and every
instances of the log statistically.
6. CONCLUSION
Recent research suggests data mining techniques for fraud detection and anomaly
detection. The unsupervised learning technique is most useful for this objective because it
deals with the behavior of the complex data. Cluster analysis will always produce grouping
based on several parameters some of them are available for the researcher to customize
cluster analysis. Here this paper shows the usage of different tools for same wireless log and
its result interpretation. Among these tools MS SQL Server provides the best ideal model.
Some tools have data size limitations. Some tools are best suited for pure statistical analysis.
The MS SQL Server has limitation it does not available under GPL however it’s more
preferable to deal with lengthy, complex and dynamic behavioral data among other
experimented tools.
REFERENCES
1. Marc M. VAN HULLE and Jesse DAVIS, “Data Mining” in Laboratorium voor Neuro-
en Psychofysiologie, Katholieke Universiteit Leuven, pp. 1–54.
2. Mrs.P.Nancy and Dr.R.Geetha Ramani,” A Comparison on Performance of Data Mining
Algorithms in Classification of Social Network Data” in International Journal of
Computer Applications (0975 – 8887) Volume 32– No.8, October 2011
3. Glenn A. Growe, Thesis on “Comparing Algorithms and Clustering Data: Components of
the Data Mining Process” in Grand Valley State University, 1999.
4. Reference Book on “802.11 Wireless Networks The Definitive Guide” By Mattbew S.
Gast; Published By: O’Reilly; ISBN: 0-596-00183-5
5. Thuy Van T. Duong and Dinh Que Tran, “An Effective Approach for Mobility Prediction
in Wireless Network based on Temporal Weighted Mobility Rule”, Published At:
International Journal of Computer Science and Telecommunications [Volume 3, Issue 2,
February 2012], ISSN 2047-3338
6. Mohamed Medhat Gaber, Shonali Krishnaswamy, and Arkady Zaslavsky, “A Wireless
Data Stream Mining Model”, Published At: ICEIS
250
- 11. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
7. M.Moorthy and S.Sathiyabam,” A Hybrid Data Mining based Intrusion Detection System
for Wireless Local Area Networks”, International Journal of Computer Applications
(0975 – 8887) Volume 49– No.10, July 2012
8. Balaji Rengarajan and Gustavo de Vecian, “Data Mining and Coordination to Avoid
Interference in Wireless Networks”, supported by: Intel Research Council and the NSF
Award CNS-0721532
9. A CRS Report for Congress”Data Mining: An Overview” By Jeffrey W. Seifert
10. A Research Paper on “Guide to Intrusion Detection and Prevention Systems (IDPS)” By
Karen Scarfone and Peter Mell; Published By: NIST Special Publication 800-94
11. Theodoros Lappas and Konstantinos Pelechrinis, “Data Mining Techniques for (Network)
Intrusion Detection Systems”
12. R. Manickam, D. Boominath and V. Bhuvaneswari, “An Analysis of Data Mining: Past,
Present and Future”, International Journal of Computer Engineering & Technology
(IJCET), Volume 3, Issue 1, 2012, pp. 1 - 9, ISSN Print: 0976 – 6367, ISSN Online: 0976
– 6375
13. Mr. M. Karthikeyan, Mr. M. Suriya Kumar and Dr. S. Karthikeyan, “A Literature Review
on the Data Mining and Information Security”, International Journal of Computer
Engineering & Technology (IJCET), Volume 3, Issue 1, 2012, pp. 141 - 146,
ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375
14. R. Lakshman Naik, D. Ramesh and B. Manjula, “Instances Selection Using Advance
Data Mining Techniques”, International Journal of Computer Engineering & Technology
(IJCET), Volume 3, Issue 2, 2012, pp. 47 - 53, ISSN Print: 0976 – 6367, ISSN Online:
0976 – 6375
AUTHORS’
A. Mr. Ajay M. Patel is an assistant professor of faculty of computer application of
Ganpat University in India. He is well interested in networking era. He has also work with
data mining and gets enough expertise on data mining with wireless network. His ongoing
research focused on intrusion detection in wireless LAN. He has published number of journal
and conference papers in the area of his research interests. He is currently working on pattern
matching and predication of wireless network traffic.
B. Dr. Ashok R. Patel an eminent personality interested in finding ways to improve the
teaching and learning process. The author has enormous research experience in the E-
commerce and E-Governance. He has guided more the 15 Ph.D. students as well as Post
Graduate level students in the diversified fields of computer application such as data mining,
neural network, computer network, enterprise resources planning etc. He is a director of
department of computer science of H. North Gujarat University of India. He is also working
as a director in AICTE the apex body in India for technical education.
C. Ms. Hiral R. Patel is an assistant professor of faculty of computer application of
Ganpat University in India. She is starting to working on pattern matching and predication of
financial data and wireless network traffic.
251