SlideShare ist ein Scribd-Unternehmen logo
1 von 59
1
CHAPTER 1
1.1 AIM:
Optimization And Combination Of K Nearest Neighbours For Intrusion Detection System
through Genetic Programming using KDD CUP 1999 dataset.
1.2 OBJECTIVE:
 Can GP based numeric classifier show optimized performance than individual
NN classifiers?
 Can GP based combination technique produce a higher performance OCC as
compared to NN component classifiers?
 Can heterogeneous combination of NN classifiers produce more promising
results as compared to their homogenous ones.
1.3 SCOPE:
The scope of this undertaking is wide. It can be utilized not merely for safeguarding web
arrangement but can additionally be utilized for noticing frauds in the banks and
supplementary ecommerce business. The arrangement projected in this paper will be utilized
for supplementary setbacks by employing disparate datasets. Like health data, iris data and so
on.
1.4 STRUCTURE:
In the pursuing, a description is endowed on how the dissertation is coordinated in
understanding the believed of genetic software design and how it can be utilized for
optimization and combination of K nearest acquaintances for intrusion detection system.
Early chapter includes data concerning the Aim, Scope, and Methodologies. Chapter two
provides the introduction concerning genetic programming. In chapter three we debate the
works survey. Chapter four gives the methodologies utilized and Fifth chapter is established
on conclusion and references.
1.5 Methodology:
In this serving we will debate the assorted steps that will be seized for the progress of our
project. There will be two periods in development. In early period we will focus on the
progress of the optimal KNN classifier employing fitness of individual. In the subsequent
period we will focus on the combination of the optimal KNN classifier on the basis of ROC
bends obtained.
2
Fig 1.1 Gantt Chart:
3
CHAPTER 2
2.1 Abstract:
From the onset of web arrangement, protection menaces normally recognized as intrusions
have come to be extremely vital and critical subject in web arrangements, data and data
system. In order to vanquish these menaces every single period a detection arrangement was
demanded because of drastic development in networks. Because of the development of
arrangement, attackers came to be stronger and every single period compromises the
protection of system. Hence a demand of Intrusion Detection arrangement came to be
extremely vital and vital instrument in web security. Detection and prevention of such
aggressions shouted intrusions generally depends on the skill and efficiency of Intrusion
Detection Arrangement (IDS).As alongside rise in web scalability alongside elevated pace,
The demand for light heaviness Intrusion Detection Arrangement alongside elevated
detection ratio is a necessity .Therefore countless ensemble mechanism has been counselled
by employing countless methodologies’, these methodologies have their own benefits and
short comings. In this paper we will focus on ensemble of classifiers employing genetic
programming. We will debate how genetic software design is a good way for joining
constituent classifiers; next we will debate how to accomplish the highest presentation
measures.
2.2 INTRODUCTION:
With the advancement in technologies and quick development of webs, there is a demand of
becoming functional data from a large volume of data. In period like company, health,
engineering and supplementary fields there is a demand for seizing data from assorted input
origins and next categorizing them precisely in their corresponding classes. In health,
manufacturing, logical and business requests the data generated is becoming extremely
convoluted and tough for the humans to make sense out of it. Countless ways are gave by
scientists, for computers not to merely produce data but are trying to make computers
comprehend concerning the data .The most methods for becoming functional data out of large
data are data excavating and vision discovery. The main target of practioners is to design a
elevated presentation forecast model. Countless methods have been counseled for data
classification. Statistical methods proved functional instruments for data analysing of
association problems. Moreover, a collection of supplementary methods have been counseled
like expert arrangements, manmade neural webs, furry logic, DT and evolutionary
4
computing. These are extra competent in countless request spans as contrasted to statistical
techniques. These intelligent methods are good for decision and proposal larger presentation
and can be facilely concentrated to encounter the necessities for decisions. These intelligent
methods are extra precise than human specialists.
A lot of efforts are being grasped for requesting intelligent techniques. Usually in association,
we don’t have each prior vision considering the input data and concerning the representation
of final forecast model. Because of these subjects that makes setbacks to intelligent methods,
that could optimize presentation of countless association model.
Genetic software design methods have been requested in growing requests of outline
classifications. In object detection, segmentation, feature extraction and additionally
classifications, GP proved effective. The key feature to the accomplishment of intelligent
methods is its optimization. There is a lack of finished methods in optimization of association
models.GP models are extra flexible and moderately extra general; it provides elevated skill
of learning. These features of GP aid in arranging association model.To grasp the
optimization setbacks; GP prosperously optimizes assorted association models.
A new division of evolutionary computation is Genetic Programming, because of
representation of resolution to each setback by genetic software design in a flexible for of
computer program; because of this representation it provides an competent resolution to each
problem. GP established find proposals new optimization methods in classification. Assorted
methods like DT classifier, association law sets can be industrialized employing GP. GP
gives a new form to classifiers shouted as numeric classifiers, these classifiers encompass
arthematic logical and could be joined alongside conditional statements.GP established
numeric classifies are requested in assorted setbacks such as illness diagnosis, intrusion
detection, outline matching, company etc.
GP is easy flexible facile and influential technique. For optimization setbacks, GP has been
extensively utilized, e.g. ideal induction and automatic programming.
Various association models have been industrialized for specific problems. Due to the low
presentation of these specific request models, GP established ensemble method is retained in
this paper that will enhance the presentation of assorted association models.
In countless requests, individual classifiers are joined to make maximum benefits. A
combination of two or extra models is extra competent than individual one. The shortcoming
of one classifier can be facilely substituted by supremacy of other. The individual classifier
and there diversity will aid in growing an optimal composite classifier.
5
Several center classifiers are crafted from discovering algorithms. The optimal classifier is
obtained by joining the output of constituent classifiers across a combination scheme.
The two issues regarding these methodologies are as under:
1. How to create suitable individual classifier
2. The best way to combine them i.e how to combine them
These issues will be addressed in this paper
GP addresses these subjects by producing purposes across evaluation process. Heuristic
established find method will not find a optimal resolution due to intricacy in hunting in a
colossal space. GP proposals an ultimate resolution to this problem.
GP provides a larger combination of classifiers recognized as numeric classifiers. In this
paper nearest acquaintance classifiers are joined for growing a optima classifier for intrusion
detection system.
The methods established on GP can efficiently join the output obtained from individual
classifiers to develop an optimal numeric classifier. This method manipulates a populace of
suitable solutions. The main focus of GP is the find space that encompasses the best optimal
solution. GP uses its inherent qualities of adaptation, flexibility and generalization.
Adaptation helps in monitor in classifier combination and refines the presentation as
conditions change. Flexibility helps to work robustly opposing incomplete or inconsistent
data. Generalization helps in growing new GP established classifiers that are competent in
decision opposing the unseen examples in data alongside no facts.
Receiver working characteristics (ROC) arc is utilized for computing presentation of
classifier. The classifier could not present well because of overlapping class allocation, for
this AUCH of ROC arc is utilized for optimal numeric classifier. In this paper
ROC arc is utilized as GP fitness function. The roc arc is utilized for analogy of classifier. A
larger ROC arc of composite classifier could lead to a larger decision.
2.3 Numeric expression classier:
A numeric expression is each purpose that returns numeric value. The purpose can
encompass each mathematical operator i.e. arthematic, logarithmic and supplementary
operator. The classifier seizes setback variable as input and produces an output in the form of
numeric worth.
6
2.4 GP OPERATORS:
Crossover: the crossover operator is the most operators used. It merges the genetic physical
from two parent trees and produces two offspring tress. In every single parent a point is
randomly selected as cross above point
Mutation: it is used to create a random change to learning tree an individual is selected for
mutation using fitness proportional selection.
Reproduction: a copy is just made for an individual to go into the next, it makes no change
to a program. Fig 2.1 Block diagram for genetic programming process
yes
Perform replication Perform crossover
Copy into new population Insert two offspring into
new population
Individual= individual+1 Individual=individual+2 Individual= individual+1
individual+1
Select one individual based
on fitness
Select one individual
based on fitness
Select one individual based
on fitness
Generate counter=0
Generate intial M individuals
Termination
criteria satisfied
Save the best result
Evaluate fitne of each individual
Set individual=0 and create new population
Individual =MGen=Gen+1
Select genetic operation probability
Perform mutation
Insert offspring into new
population
7
2.5 Genetic Programming For combining Classifiers:
The main believed behind the combination of classifier is larger accuracy next individual
classifiers.
The data excavating and vision area counselled that the combination of individual classifier
to develop a meta classifier will frequently enhance the accuracy. There are countless such
combinational methods that merge countless classifiers to produce a easy classifier of higher
performance.
The precarious nature of composite classifier is suitable for ensemble. Precarious way the
discovering algorithm whose output adjustments in accountable to a tiny change in training
sample.
Hansen and Solemon counselled that a combination of classifier is extra precise if and merely
if individual classifiers are precise and diverse. In their thesis they proved that if the
constituent classifies is autonomous and if error laws are less than 0.5, next error rate of
joined classifier will cut alongside rise in number of individual classifier. From this thesis we
came to understand that accuracy and diversity of individual classifier is of extra importance.
In literature large number of combination have been proposed. The combinations are divided
into serial and parallel. Parallel combinations is the most widely used. It consist of a set of
component classifier and a best combining algorithm, the combining algorithm combines the
output of individual classifier to make a final classifier as shown in fig
ck class
c2
c1
Most commonly used a parallel combination of classifiers fig 2.2
There are two methods for combining component classifiers:
1. Generative: initially creates the base classifier and then ensemble them. Boosting,
bagging and mixture of experts are some generative techniques.
2. Non-Generative : usually combine independent classifiers.
Combination of classifiers depends on output supplied by component classifiers.
Input x
pattern
Classifier 3
Classifier 2
Classifier 1
Combination
8
Our GP based technique will be NON-Generative which will combine information at
measurement level.
2.6 Categorization of combination of classifiers:
Dietterich described a lot of combination methods based on machine learning.
Sharky pointed out that the limiting point i.e factor for combining of classifier is due to lack
of awareness of full rule of available modular structure, because of little agreement as
describing and classifying various class of classifiers. The comprehensive categorization
scheme of classifier ensemble is shown.
1. Voting classifier ensembles
2. Classifier ensembles by manipulating training samples
3. Homogeneous classifier ensembles
4. Recursive partition ensembles
5. Heterogeneous classifier ensembles
2.6 Voting classifier ensembles: thing three main categories are as:
1. A simple voting scheme: in this scheme each individual classifier is an equally
weighted votes. The input is assigned to a high majority voted classifier.
2. Weighted electing scheme: every single poll receives a heaviness, proportional to
approximated generalization, presentation of the corresponding classifier. This
scheme has higher presentation than easy electing.
3. The weighted majority algorithm: is similar to weighted voting but the difference is
how weights are generated.
2.7 Classifier ensembles by manipulating training samples:
In such approaches, the learning algorithm runs many times and in each time with different
training sample partition. Boosting and bugging are two most successful.
Bagging: is a easy method for ensemble of classifiers. It is abbreviated from bootstrap
aggregates. In this method, primarily disparate datasets are crafted and afterward this every
single data set is selected randomly. Contraption discovering method is utilized for training a
classifier above every single of these training sets. These classifiers are next requested to the
examination example of data by employing electing scheme. Contraption discovering
produces a disparate classifier for disparate datasets. These classifiers are joined.
9
Boosting: Was counselled by Schapire in this algorithm each frail discovering algorithm
could be boosted to forceful one established on a little hypothetical ideal shouted as frail
discovering ideal (PAC).
PAC endeavours to join trained classifier for a given setback alongside extra general. In this
method the whole data is utilized for deriving a classifier.
Homogeneous classifier ensemble: in this ensemble technique only a specific classifier is
combined in classifier ensemble.
Recursive partition classifier ensemble: divide and vanquish strategy is utilized to partition
a space into subset or instances of merely one class. This ideal could be utilized to join
decision trees, linear Discriminant purpose and instance established ensemble.
Heterogeneous classifier Ensemble: Meta learning and stack generalization are two most
used methods. In this approach different type of classifiers ensemble for higher accuracy.
2.8 GP in Combining Classifiers:
1. Generating constituent classifier: a lot of investigation has been grasped out to
accomplish probable supremacy of GP for benefitting association results. The
classifiers are generated by bagging and boosting technique. These methods could be
utilized for training several classifiers at disparate examples of training data. Next the
trained classifiers are joined to form a solitary classifier that can enhance GP
aftermath.
2. Generating decision trees: to craft decision tree construction at every single inner
node, decision conditions that were utilized by decision tree builders were substituted
by GP trained numeric expression. A easy, moderately inaccurately expression is
trained at origin node. The data example is bypassed one of the two youngster
divisions, reliant on association from early origin node. At every single node GP
evolved expression was trained to once more categorize the data. This method of
joining the decision tree construction alongside GP expression has higher accuracy.
3. Combining classifiers: LangDan et al, has a lot of contribution in the development of
composite classifier using GP. Their main motive was to give a trade-off between
FPR and TPR for producing highly optimized ROC curve. Various classifier like
neural networks,DT, NB were combined.
4. Architecture of GP based combination technique:
a. The first layer is same as that of stack generalization
10
b. The output of all individual classifier has been combined to form a new derived
training data. Composite classifiers are developed using genetic programming .
c. GP based combined classifiers are using threshold T as variable for computing
AUCH of ROC Curve.
A SET OF SUITABLE COMPONENTpr
CLASSIFIERS
Fig 2.3 The architecture of the construction of a composite classifier.
2.9 Computing the Prediction of Component Classifier:
Two general approaches are there for prediction computation and are as under:
1. Recursive Partitioning the input data space: input data is recursively portioned into
subspaces by constituent classifiers. Those subspaces that are in final partition are
allocated as forecasts of all instances in that subspace. These algos are requested
across decision trees and uses a tear and vanquish way.
2. Global data scope: globe data scope believed is been utilized. A forecast is made by
every single constituent class on input instances and next forecasts are joined by
joining schemes. Every single classifier is tested for all instances. Stack generalization
method is being retained.
INPUT DATA
OCC AS A FUNCTION OF
C1(TOcc as a function of
c1[t],c2[t],..........cn[t]
GP SIMULATION
CYCLE
OCC AS A FUNCTION OF
C1(T), C2(T)..... CN(T)
t[0..1] and other random
), C2(T)..... CN(T)
GP
SIMULATION
CYCLE
t[0..1] and other
random VARIABLES
Occ as a function of
c1[t],c2[t],..........cn[t]
11
2.10 GP Based Learning Algorithm:
For developing composite classifiers, two phases are required which are training phase and
classification phase. Pseudo code for these phases are given below:
Training Pseudo Code:
S tst: represents testing and training data
C(X): class of instance x
OCC: a composite classifier
C k: kth component classifier
C k(X): prediction of C k
Train-Composite Classifier: (S t, OCC)
1. All input data is given to k component classifiers i.e x € S t
2. Collect [C1(x),C2(x),C3(x)………,Ck(X)] x € S t for forming a prediction vector.
3. Combine using GP method and take T as threshold to compute AUCH of ROC curve
and prediction is used as unitary function in GP tree.
Pseudo Code for Classification:
1. To dada sample x taken from Stst apply composite classifier
2. X=[C1(X),C2(X),C3(X),…….Ck(X)], for forming new derived prediction, stack the
predictions.
3. OCC(X) is computed.
12
CHAPTER 3
LITERATURE SURVEY:
(Michal Woznaik, Manuel Grana and Emilio Corchado 29 April 2013). The paper presents a
survey on upto-date data multiple classifier systems through hybrid intelligent system. The
major issues which were discussed in this paper were diversity and the methods for decision
fusion.
The system topologies which are used to design MCS in this paper are as under:
1. Parallel topologies
2. Serial topologies based on adaboost algorithm.
The paper addresses the issues i.e Ensemble design and fusion design. In Ensemble design
they include mutually complementary individual classifier on the bases of high diversity and
accuracy. Fuser design is depicted in fig;
Architecture of the MCS making decision on the basis of class label fusion only fig 3.1
DECISION
Fusion Combination
Rule
CLASSIFIER 1
CLASSIFIER 2
CLASSIFIER N
13
Architecture of the MCS which computes the decision on the basis of support function
combination. Fig 3.2
The MCS is not the only option for hybridization. The other possibilities are as under:
1. Merging of raw data from different sources are collected and stored in one repository for
classifier training.
2. Prior expert knowledge and merging the raw data from different sources.
3. Merging prior knowledge and models machine learning procedures.
To design such systems the main points for consideration are as; data, privacy computation
and memory efficiency.
(In International journal of pattern Recognition and Artificial Intelligence) The paper
forwards a SVM ensemble based on choquet integral. The aim of this ensemble model was to
predict financial distress using bagging algorithm. The proposed ensemble can be expressed
as “choquet + SVMS + Bagging”. Choquet integral has higher average accuracy and stability
then single SVM classifier.
(Durga Prasad Muni, Nikhil R.Pal senior member IEEE and Jyotirnoy Das) A new approach
for designing classifier using genetic programming was proposed. In the paper an integrated
view of all classes is taken when genetic programming starts evaluation. In paper a modified
decision
Count common
supports for
each class and
make decision
according to
them
Support for
each class
Support for
each class
Support for
each class
Support
for each
class
14
mutation and crossover operation were proposed to reduce the destructive nature of genetic
operations.
A new believed of unfitness of a tree to select a specific tree for genetic procedure was used.
The intention of this unfitness was to furnish a opportunity for unfit tree to come to be fit. In
the terminal nodes a new believed of OR-ing was gave that gives a classifier alongside best
performance. For fight resolution alike heuristic laws that characterises unclear situation and
weight-based scheme was used. The classifier that is selected is able to say “ I don’t
understand “ after it faces out situation that are out of its vision domain. The effectiveness of
the way is requested on countless real data sets. A solitary run of GP is utilized to design a
classifier for outline classification. The normal genetic software design implementation
involves the pursuing steps;
1. It begins with random generation of population of solution having size N.
2. Then to each solution of population a fitness value is assigned.
3. Probabistic selection of genetic operator.
The various dataset used in this paper are as;
A. IRIS dataset
B. Wisconson Breast Cancer
C. BUPA liver Disorder
D. Vehicle Data
E. RS-Data
The GP approach proposed in this system for designing a classifier requires only a single GP
run for optimal classifier evaluation.
The various contribution of the paper are as under;
1. An approach for designing classifier for multi category problem using a concept of
multitree in Genetic Programming.
2. To use operations like crossover or mutation, tree selection depends on unfitness.
3. Modified crossover operation.
4. Non-destructive direct mutation point operator known as modified or new mutation
operator.
5. To optimize classifier an OR-ing operation.
6. For conflict resolution a weight base scheme is used.
7. Modified Kishore et al, heuristic rule.
15
The classifier based on GP was tested with various dataset. The result obtained was
satisfactory.
Limitations:
1. Size of tree
2. Simultaneous feature analysis with classifier design.
(Niusvel Acosta-Mendoza, Alicia Moraj Es-Reyes, Hugo Jar Escalante and Andres Gago-
Alonso 2014). The paper utilized a novel way employing genetic software design for
constructing heterogeneous ensembles. Ensemble discovering is a novel way aiming at
joining assorted individual classifier’s output for presentation improvement. The output of the
classifiers is computed by bulk electing or weighted sum. In the paper for discovering of
fussion purpose, that are accountable for ensemble classifiers outputs, uses GP established
approach. The main focus of this paper is on ensemble of heterogeneous classifiers. The
individual classifiers are established on disparate principles. The consequence displays that
the method counselled in this paper is exceedingly prosperous at constructing extremely
competent models. The counselled methods can additionally be utilized for joining
homogeneous methods.
(K.M Faraoun, A Bookelif). The paper presents a new way employing genetic software
design for classification. The method genetically co-evolves a populace of non-linear
makeovers on the data . That is to be categorized and last on, it charts the data to a new
subspace that is dimensionally decreased to become a higher inter-class discrimination. It is
facile to categorize the new example from the data that is transformed. The method uses
vibrant repartition of transformed data employing distinct intervals, the efficiency is grasped
by fitness criteria alongside a higher class discrimination. It uses Fisher IRIS dataset.
It is benchmarked alongside two datasets. The Fisher IRIS and MIT KDD CUP 99. Fisher
IRIS is utilized for analogy and to clarify the method capabilities. The MIT KDD CUP 99 is
utilized for intrusion detection. The presentation rates are as under;
DR= 0.980(98%)
FP= 7E-4(0.07%)
Classification rate=99.05%
The technique is independent of the dataset and structure of GP employed.
(Gianluigi Folino, Giandomenico Spezzano and Clara Pizzuti). The paper presents a Genetic
software design way for association of data and instigate ensemble of predictor. Individual
classifier will be trained for disparate subsets of data from finished data, next a bulk electing
16
algorithm e,g bagging is utilized to join the classifiers. The aftermath of this way on a
colossal data set ambitious that the inclusion of disparate classifier that were trained on a
example of the data acquires higher accuracy next that of a solitary classifier. The solitary
classifier has most computation price than that of ensemble classifier.
1. The main feature of proposed model is that each sub population generates a classifier
which works on a sample of training data instead of using all training set.
2. The approach i.e CGCP is able to cope up with large dataset which are incompatible
with main memory.
3. The various experiments on a large real data showed that high accuracy can be
obtained by using a reasonable size of sample data at lower computational cost.
(Giandomenico Spezzano Lianluigi Folilo and Clere Pizzuti.) An intrusion detection
arrangement established on GP is proposed. The GP algorithm is projected for distributed
webs for monitoring protection connected hobbies that transpire inside a network. Every
single web encompasses a cellular plan established on genetic software design, the main aim
of the plan is to produce decision tree predictor. The plan is trained on the innate data stored
in the node .the cellular genetic plans work obligingly but independently. The plan seizes the
supremacy of ideal for exchanging the outmost individuals. This helps in computation of
classifier. After classifiers are computed, they are joined to form the ensemble established on
GP. The dataset utilized by this way is KDD CUP 99. Confusion matrix of the arrangement
counselled is as : Table 3.1 for confusion matrix
Normal Probe Dos U2R R2L
Normal 60250.8 200.2 110.8 15.4 15.8
Probe 832.8 2998.4 263.6 26.4 44.8
Dos 7464.2 465.0 221874.8 19.2 29.8
U2R 139.6 45.2 17.2 11.8 14.2
R2L 15151.4 48.6 232.4 173.8 582.8
The paper clarified that we can think an IDS as several useful entities, these entities are
encapsulated to form an self-governing agent. This paper additionally clarified that genetic
software design can be utilized as a best discovering paradigm for training the agents that can
notice intrusion actions potentially.
(Urresh Bhowan , Mark Johston, Member IEEE, Mengjie senior member IEEE, Xin Yao
Fellow IEEE) The paper addressed the presentation bias endured by contraption discovering
algorithm0s because of unbalanced data set. Data set unbalanced way one class is embodied
17
by a colossal number of training example shouted bulk class ,and supplementary one
recognized as minority class that encompasses tiny number of training examples. In
unbalanced dataset scenario, bulk classifier has good accuracy as contrasted to minority ones.
This setback is addressed in the paper, it proposes a multi goal genetic software design way,
that evolves precise and varied Ensembles of genetic plan classifier. These classifiers present
well on both classes. The paper counselled a arrangement that evaluates the effectiveness of
two accepted Pareto-based fitness strategies (SPEA2, NIGA11). The methods are
investigated, reassuring diversity amid resolutions in evolved Ensemble. The paper proves
that the Ensemble outperforms next their individual associate.
(Detecting New Forms of Intrusion Employing Genetic Programming.)In the paper A law
progress way that is established on Genetic Software design for noticing novel aggressions on
webs and four genetic operators are presented. The four operators utilized to evolue new
laws are as, reproduction, mutation, crossover and dropping condition. These new laws are
utilized to notice novel or recognized attacks. DARP training and assessing dataset is utilized
to evolue these new rules. The law generated by genetic software design has a low fake
affirmative law ,a low fake negative rate and a elevated rate of noticing unfamiliar attacks.
The new laws have elevated detection rate alongside low fake alarm rate.
(A Genetic Algorithm established Web Intrusion Detection System). A contraption
discovering algorithm was utilized to recognize the kind of connection i.e whichever attack or
normal. The GA seizes into thought assorted features of the connection that are as; kind of
protocol, web abiility, rank of connection on the destination and the rank of connection for
producing a association law set. The KDDCUP 1999 was utilized to produce such law set that
can be requested to recognize disparate class of web aggressions connections. In this paper a
law set was industrialized that consists six disparate kind of attacks. The aggressions
plummet in two classes namely DOS and Probing. The law generated has 100% of accuracy
for noticing the DOS aggressions and has the appreciable accuracy for noticing probe kind of
attacks. The aftermath of this examination have given enthusing aftermath.
(Intrusion detection employing error correcting output program bassed Ensemble). In this
paper the setback that is tackled is concerning class imbalance, rise detection rates for every
single class and minimize the fake alarm in intrusion detection . in this paper a examination
gave on seven classifier employing bagging and adaboost ing ensemble methods. A new
hybrid ensemble established on error Error Correcting Output Program way was designed.
This way is established on multicast binary association methods. The seven classifier utilized
18
for examination investigation are as below Navïe Bays, Multi Layer Perceptron, Prop Vector
Machine, Radial Basis Purpose Neural Network, J48, Random Tree and Random Forest. The
new way gave by this paper enhances the accuracy (99.7%). It additionally increases
detection rates and reduces fake alarm even for the minorities classes.
(Rohan D.Kulkarni) The KDD CUP 1999 is utilized for intrusion detection and outline
matching. A lot of examinations had been grasped out on this dataset. Countless researches
has analysed this dataset. The paper encompasses aftermath that were obtained afterward
categorizing 10% of kdd cup dataset employing ensemble methods like bagging, boosting,
adaboost ing and assesses their presentations alongside average j-48 algorithm.
(Roshni dubey and Pradeep nandan pathak). In this journal paper a hybrid design for
intrusion detection that merges anamoly detection alongside misuse detection. In this paper a
method was counselled that includes an ensemble feature selecting classifier and a data
excavating classifier. The preceding consists of four classifier employing disparate set of
features and every single employees a contraption discovering algorithm shouted furry belief
K-NN association algorithm. The afterward uses data excavating methods to remove
computer users normal actions from web traffic data. The paper next afterward on ensembles
the output of feature selecting classifier and data excavating classifier are next fudsed
combinely to become the final decision. The output of the examination gave in this paper
established on hybrid way efficiently generates a extra precise intrusion detection ideal for
noticing both the normal and malicious hobbies.
(M.R Moosavi, M. Zolghadri in 2012). In this paper a novel cost-sensitive discovering
algorithm is counseled to enhance the presentation of nearest acquaintance for intrusion
detection. The paper focuses to minimize the finished price in a leave-one-out association of
the given training set. As intrusion detection is a setback in that price of disparate
misclassification are not same. The distance purpose is described in a parametric form to
optimize the nearest acquaintance for intrusion detection. The counselled feature-weighting
and instance-weighting algorithm is utilized to adjust the free parameters of the distance
function. The feature weighting algorithm can be viewed as finished intention wrapper
method for feature weighting. To remove loud and redundant training training instances from
training set the instance-weighting algorithm is used. This enhances the speed and
presentation in generalization phase. The paper uses the KDD CUP 1999 dataset. Employing
this dataset the scheme is prosperous in cutting the average price of association on
19
beforehand unseen data. The scheme additionally removes the redundant data features and
instances by setting their heaviness to zero.
(Mark Croshie, Eugene H.Spafford). The paper presents a resolution to setbacks that arise in
intrusion detection considering computer security. The ideal merges the manmade existence
and computer security. The paper uses self-governing agents for implementation of intrusion
detection system. The arrangement uses automatically described purpose for evolving genetic
software design encompassing several data kind and assures the type-safety.
(Amit Kumar, Harish Chandra Maurya, Rahul Misra). In this paper an IDS consists of four
constituents according to CISF framework and are: event dynamos, analysers, event
databases and reply units. In the paper dataset is utilized to furnish attack and normal data to
analyzer. The best contraption discovering algorithm is utilized that enhances the detection
rate alerts. The data center will both train and assess the presentation of the analyzer and to
evolue its forecast.
(Niusvel Acosta-Mendoza and Hugo Jair Escalante 2012). The paper gives a novel way for
constructing ensembles that will be established on genetic programming. In the paper a GP
established way is utilized to discover mixture purposes that join outputs of every single
classifier. A methodical empirical assessment is grasped out to to validate effectiveness of the
counseled approach.
(Vipin Das,Vijaya Pathak,Sattvik Sharma 2014). In the paper the Rough Set Theory and prop
vector contraption is utilized to notice intrusion.RST is utilized to pre-process the seized data
and reduces the dimensions. The pre-processed data is dispatched to SVM ideal for
discovering and assessing respectively. This method cuts the space density of the data.
(Carlotta Domeniconi and Bojun Ya). In the paper the unpredictability of KNN classifier is
exploited alongside respect to disparate choice of features to produce varied NN classifier
alongside uncorrelated errors. The paper utilises the elevated dimensionality of the data. The
consequence displays that the method proposals the presentation improvements.
(Yan-Nei Law and Carlo Zaniolo). An incremental association algorithm is counseled in this
paper. The algorithm uses the believed of multi-resolution data representation and finds a
adaptive nearest acquaintance of a examination point. The incremental algorithm achieves the
best presentation by employing tiny ensemble classifier. The classifiers guarantees error
bounds for every single ensemble size. The classifier is exceedingly suitable for data stream
applications. The examinations gave on the synthetic and real liofe data indicates the
20
counseled algorithm out performs the continuing one in words of accuracy and computational
price.
(Prof.Dighe Mohit S., Kharde Gayatri B., Mahadik Vrushali G., Gade Archana L., Bondre
Namrata R 2015). In this paper the main target was to notice the kind of class and categorize
them. The examination displays that the requested consequence detects the the attack and
categorize them in 10 clusters alongside concerning 94% accuracy alongside two hidden
layer of neurons in neural network. Multi layer perceptron and priori algorithm were utilized
in this research. Back propogation method was utilized to enhance and notice the attack and
categorize all kinds of aggressions.
(Devaraju and S. Ramakrishnan). The multivariate statistical methods were utilized for
anomaly detection. Markov flawless is utilized for implementation and is utilized to notice
the arrangement on call instituted anomaly detection. Batch sequencing and Adaptive
sequencing check point detection is utilized for attack detection in web traffic. The adaboost
algorithm that includes decision regulation provides both categorical and steady features. The
algorithm focuses on four modules: feature extraction, data labelling, design of the fragile
classifiers, and encounter of the forceful classifier. The arrangement works on KDD CUP
1999 intrusion detection dataset. The two subjects of accuracy and efficiency addresse the
Conditional and layered methods.
(Upendra Assistant Professor, CSE Department, NIT Raipur, C.G., India). The paper analyses
two learning algorithms NB and C4.5 for detecting intrusions and then compares them. The
paper showed that C4.5 has increased their performance than NB. The C4.5 has highest
classification accuracy performance with lowest error rate.
(Maninder Singh, Sanjeev Rao 2015). In this scrutiny paper the analogy of all the classifiers
is grasped out. The consequence displays that all data excavating methods are not satisfactory
enough. As from this scrutiny we can say that random forest is providing extra precise
aftermath as contrasted to supplementary classifiers.
(Ajith Abraham,Crina Grosan and Carlos Martin-Vide). In this paper an Intrusion detection
plan was counseled for noticing attack patterns. This plan works as a defensive mechanism in
safeguarding the system. In this plan three variants of genetic software design are utilized that
are as under: linear genetic programming,multi-expression software design and gene
expression programming. For comparisons countless indices are utilized and next a
methodical scutiny of MEP method is provided. From empirical aftermath it is discovered
that genetic software design can frolic a main act in growing the intrusion detection program.
21
These intrusion detection plan are light weighted and most precise after difference to standard
intrusion detection arrangements that uses contraption discovering as a paradigm for learning.
The dataset utilized in this paper was coordinated by 1998 DARPA at MIT linchon labs.
22
CHAPTER 4
4.1 METHODOLOGIES:
In this paper we will retain two ways for enhancing the presentation of nearest acquaintance
classifier. The early way will be utilized for growing a genetic software design established
numeric expression classifier i.e ModNN. The ModNN classifier will be industrialized by
plainly modifying the electing and selection methods of KNN classifier.
Next way will be utilized for joining the classifier across GP established combination
techniques.
4.2 Nearest Neighbour:
The nearest acquaintance contraption discovering algorithms reliant of the locale of instances
present in the input data. The presently encountered examples are categorized established on
the data by now stored in the database. The new examples are categorized established on the
closet example ambitious by the Euclidien Distance. The decision is ambitious by the closest
k example. To allocate correct class to the data example an optimal mapping purpose f(x) is
used. For the association setbacks that have merely two classes the data is categorized in two
classes i.e whichever C1 or C2.
The output of the Nearest Neighbour classifier is ambitious by the ROC curve. Firstly their
output is ambitious and next output is scaled in the scope utilized like (0,1).
4.3 Proposed work
4.3.1 Developing Numeric Expression Classifier:
For growing ModNN, optimization methods that are established on genetic software design
will be used. The optimized classifiers have larger presentation than easy classifiers.
Instituted on the allocation of training examples the mapping from feature space to class
space is gave by these numeric expression classifiers. The assorted modules that will be
utilized for growing numeric expression classifier are as below:
23
nonn
Training data
Testing data
Output performance
Block diagram for developing an optimized numeric classifier. Fig 4.1
4.3.2 GP Module:
This module is utilized to attain an optimal solution. In this module GP operators are
requested for crafting a new creation from the selected individuals. The procedure terminates
after the module gets a wanted consequence that is optimal one. A tree is industrialized to
embody a resolution candidate. The terminal nodes of the tree encompass the steady or
variable benefits whereas the Non-terminal nodes are embodied by function. The purposes
are utilized to procedure the input benefits.
4.3.3 Fitness Computing Module:
From GP populace an individual is picked and tested as each the given Threshold scope for
performance. For all examination examples the forecast of GP individual is made. For
disparate threshold the TPR and FPR are computed to plot the ROC curve. Next the AUCH
of the individual is found. The output of this module is is given as input to GP module.
The subject alongside the GP procedure is selecting suitable attribute benefits for a GP tree to
compute its fitness. But in our case we ponder that the training example that lays distant away
from the examination example will give less to decision. So we ponder the median space for
Saving optimal nec
classifier
NEC testing module
no
Init population
generation
Fitness
evaluation
Terminat
ion
criteria=
?
Testing Phase
during Evolution
Computing
AUCH module
Construction of a
classification
model
24
selection of neighbours. In median distance we are possessing four quartile. For our
examination we will use the subsequent quartile.
In terms of quartile the projection of a test sample x ∈ Stst in 2-D is shown in the figure:
Counting normal and attack connection probability using Euclidean space. Fig 4.2
Let Q1n and Q2n be the counting of normal connection sin quartile1 and quartile2 and Q1a
and Q2a be the counting of attack connections in corresponding quartile. The counting of
attack and counting of normal in every single strip could have disparate weight. Extra
weights are automatically given to tinier strips possessing across GP process. Electing is
established on the counting of every single class in the strip.
For a examination example a forecast of individual will be grasped out by bestowing the
benefits of Q1n and Q2n as inputs for computing the probable normal connection count(PN).
Comparably we will furnish the data for probable attack connection count (PA). Attack
connection probability will be computed by dividing the difference of PA and PN alongside
its sum. Attack connection probability will be contrasted alongside threshold. if the
probability is larger than the threshold or equal, the examination example is forecasted an
attack or else it is forecasted as normal
Q2
Q1
25
Yes
No
No
Evaluation module for fitness in GP process. Fig 4.3
4.4 Ensemble:
4.4.1 Combining KNN Classifiers:
For combining KNN based classifier we will use two layered architecture. In the first layer of
our architectur m component classifiers are constructed. For homogeneous or heterogeneous
composite classifier separate GP simulations are used. Combined output prediction OCC is a
function of [C1(T), C2(T), C3(T)........... Cn(T)]
Probable normal connection count
(NC) ()count()PM()pppp()count
Probable Attack Connection
Varying threshold form 0 to 1
Take a New test symbol
All test samples predicted?
Compute TPR and FPR
Predict Attack Predict Normal
Attack prob >= threshold
Attack prob=(AC-NC)/(AC+NC)
Is Threshold = 1
Compute AUCH
26
C1(x) C2(x) C3(x) Cm(x)
A block diagram to develop optimal composite classifier is shown in the figure:4.4
The four main parts in the figure are:
 Input dataset
 Construction and selection of NN component classifiers
 Computation of GP fitness function
 GP process to develop OCC
4.4.2 KDD CUP 1999 DATASET:
The KDD CUP dataset is divided into three equal but non-overlapping sets such as training
data1, testing data1, testing data2 using holdout method.
4.4.3 Construction and selection of KNN component classifiers:
First pace in selecting KNN classifier is to craft a set of countless higher giving
complementary nearest acquaintance classifier. And the subsequent is that the inclusion of
several duplicates of the alike KNN constituent classifier ought to not rise its presentation as
elevated as several duplicates of disparate classifiers.
Several KNN constituent classifiers are crafted for assorted choices of k by employing
Random Selection and Best random selection.
-+-
-+
- +
-+-
++
GP based combining classifiers
technique
T and other random
variable
-+-
++
-+-
++
27
4.4.4 GP Fitness module:
The data example seized from the assessing data1 will ascertain the fitness of every single
individual. Decision benefits of people are obtained. Later fluctuating the threshold T in the
range[0,1] the TPR and FPR benefits are computed. The ROC is obtained by plotting these
benefits and the AUCHs of ROC arc is obtained. the individual that has higher AUCHs has
higher performance. After the fitness score exceeds 0.999 or the number of generations
reaches the predetermined maximum creation, the GP simulation will be halted.
What is Machine Learning?
Learning, like intellect covers such a large scope of procedures that it is tough to delineate
precisely. A lexicon meaning includes phrases such as “to gain vision, or understanding of or
skill in. by notice, education, or experience.” and ‘‘modification of a behavioral tendency by
experience.” Zoologists and psychologists notice discovering in animals and humans. In this
book we focus on discovering in machines. There are countless parallels amid animal and
contraption learning. Certainly, countless methods in contraption discovering derive from the
efforts of psychologists to make supplementary precise their theories of animal and human
discovering across computational models. It seems probable additionally that the thoughts
and methods being discovered by researchers in contraption discovering might illuminate
precise aspects of biological learning.
As regards mechanisms, we might say. Extremely generally, that a contraption learns
whenever it adjustments its assembly, design, or data (based on its inputs or in answer to
external information) in such a manner that it’s anticipated upcoming presentation improves.
A slight of these adjustments, such as the supplement of a record to a data center, plummet
cozily inside the span of supplementary disciplines and arc not vitally larger understood for
being yelled learning. But for example, afterward the presentation of a speech-recognition
contraption enhances afterward hearing countless examples of a person’s speech, we sense
quite validated in that case to say that the contraption has learned.
Machine discovering normally remarks to the adjustments in arrangements that present tasks
associated alongside manmade intellect (AI). Such tasks involve recognition. Diagnosis,
arranging, robot domination, forecast, etc. The ‘'changes’' might lie cither enhancements to
by nowadays providing arrangements or ah initio synthesis of new systems. To locale
somewhat supplementary specific, we display the design of a normal AI “agent’' in This
agent perceives and models its nature and computes appropriate deeds, perhaps by
anticipating their effects. Adjustments made to every single of the constituents shown in the
28
figure might count as learning. Disparate discovering mechanisms might be retained reliant
on that subsystem is being changed.
One might ask “Why must to mechanisms have to learn? Why not design mechanisms to
present as wanted in the main place?’' There are countless reasons why contraption
discovering is important. Of sequence, we have by nowadays remarked that the attainment of
discovering in mechanisms might assistance us comprehend how animals and humans learn.
But there are vital engineering reasons as well. A slight of these are:
 Some tasks cannot be delineated well except by example i.e. we might be able to
enumerate input/output pairs but not a concise connection amid inputs and wanted
outputs. We ought to like mechanisms to be able to adjust their inner assembly to
produce correct outputs for a large number of example inputs and consequently
suitably constrain their input/output intention to approximate the connection inherent
in the examples.
 It is probable that hidden amid colossal stacks of data arc vital connections and
correlations. Contraption discovering methods can frequently be utilized to remove
these connections (data mining).
 Human designers oftentimes produce mechanisms that do not work as well as wanted
in the settings in that they are used. In fact, precise characteristics of the working
nature might not be completely understood at design time. Contraption discovering
methods can be utilized for on-the-job enhancement of tolerating contraption
sketches.
 The number of vision obtainable considering precise tasks might be too large for
explicit encoding by humans. Mechanisms that notice this vision softly might be able
to arrest supplementary of it than humans ought to desire to contain down.
 Environments change above time. Mechanisms that can change to a changing nature
should cut the demand for steady redesign.
4.4.5 Genetic Programming Module:
In order to produce a subsequent populace three GP operators namely replication, mutation
and crossover will be utilized for GP process. These operators aid in meeting to optimal
solution. The optimal composite classifier is anticipated at the conclude of GP process.
GP’s are heuristic find software design projected to simulate procedures in usual system. GP
fit in to the larger class of evolutionary software design that produce resolutions to optimize
setbacks employing disparate methods inspired by usual progress such as inheritance,
29
mutation, selection and crossover. These are adaptive heuristic find software design
postulated on the evolutionary thoughts of usual selection and genetic. The frank believed of
these evolutionary software design is to rouse procedure in usual arrangement vital for
evolution. GP’s are utilized for numerical and computational optimization and established on
discover the evolutionary aspects of models of communal systems. GP way is utilized to
optimize the set of indices derived from convoluted web theory. Genetic software designs are
find software design established on the technicians of usual selection and usual genetics.
They join survival of the fittest amongst thread constructions alongside structured yet
randomized data transactions to form a find software design alongside a little sort of
innovative flair of human search.
The GP performs a balanced find on assorted nodes and there is a demand to retain populace
diversity discovery so that each vital data cannot be capitulated because there is a outstanding
demand to focus on fit servings of the population. Reproduction in GP is described as the
procedure of producing offspring. The use of GP’s has been utilized to supplement web
established approaches. GP to be utilized to optimize a set of indices derived from convoluted
web theory. The early necessity of a GP is a set of resolutions embodied by chromosomes
shouted population. The resolutions removed from one populace can be utilized to form a
new population. This can be more increased that the new populace will be larger than the
aged one. The best resolutions are selected to form new offspring. These resolutions are
selected on the basis of their fitness i.e. the most suitable offspring will become chances to
reproduce.GP’s are utilized for Search, Optimization, and Contraption Learning. GP’s are
extremely public method for optimization and are oftentimes prosperous in real requests and
to those interested in meta-heuristics. Evolutionary software design are utilized to resolve
setbacks that do not by now have a well-defined effectual solution. Genetic software design
have been utilized to resolve optimization problem.
4.4.5.1 Basic genetic Operators
 Selection
 Crossover
 Mutation
The populace diversity plays a main act in the presentation of GP. It is extensively concurred
amid GP developers that the higher the diversity in the populace the less premature
convergence chance to transpire and therefore the higher chance to getaway from a innate
optima. Disparate crossover strategies were counseled in the works to craft diversity in the
30
population. Goldberg counseled the Partially-Mapped Crossover Operator (PMX) whereas a
segment of one parent’s chromosome is mapped into a segment of one more parent’s
chromosome and the staying genes are exchanged. One more crossover operator was
counseled i.e. Series Crossover operator (CX). This operator creates the offspring from the
parents by duplicating the worth of the gene alongside alongside its locale from the parents
into the offspring seizing into thought the feasibility of the chromosome. Frequency
Crossover (FC) alongside alongside nine disparate kinds of mutations was counseled to
resolve the TSP. The FC will be utilized to stabilize the populace as the nine disparate kinds
of mutations will be utilized to rise the diversity of the populace to stop premature
convergence problem. GP’s tolerate from the difficulty of innate optimum convergence. It is
the case after an astonishing individual seize above momentous proportion of the finite
populace and leads towards the unwanted convergence. There are assorted methods to
circumvent the premature convergence, such as Restricted Mating, use of Incest Prevention,
Crowding. Familiarizing a Random Offspring in every single creation adaptive mutation rate
, immoderate crossover greediness and low impact of random factors , Communal
Catastrophe Technique, Nitching, Vibrant genetic clustering software design (DGCA).
4.4.6 Classic Genetic Programming
Step 1. Creating the initial population
Initially, countless individual resolutions are at random generated to make an early
population. Awfully normally, the populace is generated arbitrarily and covers the finished
scope of attainable solutions. Alternatively, the resolutions might additionally be "seeded" in
spans wherever optimum resolutions are doubtless to be discovered, for instance, a “seed” is
associate degree continuing resolution to be enhanced to an engineering style drawback.
Step 2. Analysis and ranking
In this pace, the target present fitness comparable to every single individual resolution is
computed and, upheld the individual’s fitness benefits, every single individual is allocated a
locale collection and additionally the populace is sorted established on these rankings.
Step 3. Selection operation
If the probability filter is gratified, a confidential is selected and totally bypassed to upcoming
generation. Choice strategies are countless though, amid the managing public, the individual
resolutions are selected across a fitness-based method whereby the fitter resolutions are
normally supplementary doubtless to be elect.
Step 4. Crossover operation
31
The likelihood of crossover is set as a parameter at the onset of the program. If the likelihood
filter is gratified, 2 people are arbitrarily recognized as parents. One or 2 offspring (variation
of programming) are next made from this join of parents. In substitution the parents, the
offspring must to be a minimum of possible. This might involve trailing completely disparate
crossover parameters to comprehend practicableness.
Step 5. Mutation operation
The likelihood of mutation has been set as a parameter. If the likelihood filter is happy, a
confidential is next elect and subjected to mutation. The target of mutation in GPs is to permit
the formula to circumvent innate minima by stopping the populace of the candidate resolution
from being manipulated by a number of best candidates, so decelerating, or maybe halting,
progress.
Step 6. Termination test
If a termination condition is grasped, the generational method are going to be terminated or
else steps two to five are going to be continual. Public terminating conditions include: a
fulfilling resolution being discovered, a fixed collection of generations being grasped, or the
formula possessing met to an optimum resolution enumerated sequential iterations no longer
produce higher aftermath.
32
No
Yes
Fig 4.5 Flowchart of genetic programming
4.4.7 GP: advantages and disadvantages
Advantages:
 The GP is, in nature, a parallel search as variety of candidate solutions are simultaneously
thought of in order that a worldwide optima is additional doubtless to be found.
 Compared to gradient-based strategies, GPs have less mathematical needs (such as
differentiability of the target and constraint purposes, continuity of the variables, etc.) for
the enhancement subjects so that they will grasp each kinds of goal purposes and
constraints delineate in different, constant or varied find spans and demand merely frank
computations in every single iteration.
Evaluation and Ranking
Mutation
Terminate
Selection
Crossover
Output result
Model
Parameter setting
Population initialization
Population
Creation of new
population
33
Disadvantages:
 GPs are less economical than the gradient-based programming once resolution optimum
issues with pure continuous variables, as indicated by the actual fact that lots additional
iterations are needed for convergence.
 Compared to gradient-based and directional strategies, several perform evaluations are
required in every iteration of the GP and so it costs far more in computer time for every
iteration.
One focus of the scrutiny is to develop a generic GP setback solver for resolution engineering
enhancement subjects, that normally have constant, number and different variables or a blend
of those. This sort of subjects is usually uttered as varied different issues. To be as finished as
attainable, GP solvers believed of across this scrutiny are for varied different improvement.
The public disadvantage of continuing GP solvers is that they lack the flexibleness in
grasping varied different issues. One design to address this subject was crafted by Deb who
projected a GP to grasp varied variables by retaining a varied different coding theme
alongside a varied different crossover and a varied different mutation operator. One setback
connected to Deb’s methodology is that these operators demand to be reprogrammed to suit
completely disparate style subjects, that is, the hidden including of Deb’s way is problem-
dependent. Such re-programming is luxurious and long thereby manipulating its colossal
request.
4.4.8 K-Nearest Neighbours classifier
The fundamental issue in data mining to address the classification problem is learning of
classifiers .In order to overcome this fundamental issue a dataset is designed ,which contains
a set of training instances and corresponding labels, based on this data set the classifier is
trained and is used to predict the the class of an unseen instance encounterd . the instance of a
class is defined by an vector. A instance x can be defined by by a vector < A1(x), A2(x)…
An(x) >, where Ai(x) denotes ith attribute value , to define the class variables and its values
we use the symbol C and c . The class of the instance x is denoted by c(x).
K-Nearest-Neighbours classifiers has been widely used in classification problems. The
classifier is holly and solly dependent on distance .The distance function is used to determine
the difference or similarity between two instances.The distance function generally uses the
standard Euclidean distance.The distance between two instances is defined as;
34
d(x, y) = √∑ (ai(x)–ai(y))2n
i=1
For any instance x, The classifier measutres the distance and assigns the instance to class of
x’s k nearest neighbors to x, as shown in Equation. The KNN classifier is a best example of
lazy learning algorithm, which stores the training data at the time of training and proves its
capabilities on the time of classification. In comparison to lazy learning , eager learning on
the time of training generates an explicit model.
c(x) =
argmax
c ∈ C
∑ δ(c,c(yii))k
i=1
In the above equation y1, y2,…., yk are most k nearest neighbors of the instance x, k is the
number of the neighbors, andδ(c, c(yi)) = 1 if c = c(yi) and δ(c, c(yi)) = 0 otherwise.
No doubt KNN classifier have been widely used for decades of years because of its
simplicity, effectiveness and robustness. There are many shortcomings of this classifier and
three main shortcomings confronting are as :
1) Euclidean distance is used as a distance function for measuring the difference or
similarity.
2) The input parameter for neighborhood size is artificially assigned ;
3) The simple voting algorithm is used as class probability estimation .
To overcome these shortcomings three main approaches are used which are as under:
1) We use more accurate distance functions as a replacement to the standard Euclidean
distance;
2) Artificial input parameter k is replaced by searching the best neighborhood ;
3) Find some more accurate class probability estimation methods to replace the simple
voting method.
4.4.9 Principal component analysis
Principal constituent scrutiny (PCA) is one of the most priceless aftermath which was
demanded from linear algebra. Principal constituent scrutiny is utilized plentifully in each
and every forms of scrutiny – starting from neuroscience up to computer graphics - its
characteristics like facile, non-parametric method of eliminating redundant data from
mystifying data sets makes it so verstile. PCA can provides a roadmap for how to cut a
convoluted data set to a lower dimension to expose the from period to period hidden, clear
dynamics that oftentimes underlie it with negligible supplementary manipulation.
Main target of this serving will be to explore both an intuitive sense for PCA, and a
methodical explaination of this topic. We will onset alongside a facile example and furnish an
35
intuitive explanation of the target of PCA. The simple way is to tolerate is to locale it inside
the framework of linear algebra by adding mathematical rigor and problem will explicitly
ascertained. To discuss how and why PCA is intimately related to the mathematical method
of singular worth decomposition (SVD). This result will lead us to a prescription for how to
make a best use of PCA in the real world. We will debate both the assumptions behind this
method as well as probable expansions to eliminate these limitations.
4.4.9.1 An Example of PCA
Assumption: Consider that we are an observer, and are trying to comprehend a slight
phenomenon by computing varied numbers (e.g. spectra, voltages, velocities, etc.) in our
system. By adversity, we may not figure out what is happening , because of the data set. The
dataset may appears unclear, clouded and even redundant. It may not be a trivial setback,
but rather a fake obstacle to experimental procedures . Examples abound from convoluted
arrangements such as neuroscience, photo science, meteorology and oceanography – there
might be a blucky number of variables to compute can , because the underlying dynamics
might oftentimes can be too simple.
To make it clear we will take an simple example of a facile toy diagrammed in Figure 4.3,
setback from physical science. Assume that we are studying the gesture of the physicist’s
immaculate spring. This experimernt contains a ball of mass m which is tied to a mass less,
frictionless spring. On releasing ball a puny distance away from equilibrium (i.e. the spring
is stretched). It is going to oscillates indefinitely considering its equilibrium at a set
frequency alongside the x-axis as it is “ideal,”.
The average setback in physics in that the gesture alongside the x association is resolved by
an external intention of time. In supplementary words, the overall dynamics can be expressed
because of a solitary variable x.
36
Figure 4.6 -Error! No text of specified style in document.-1 Diagram of the toy example.
As, being less experts in experiments we may not understand each of this. We do not
understand that, allow alone how countless, axes and dimensions are not easy to measure.
Thus, in order to compute we choose the ball’s locale in a 3-dimensional space. Simply , we
locale three movie cameras concerning our arrangement of interest. Every single movie
camera records an picture at 200 Hz representing a two dimensional locale of the ball.
Adversity , because of our ignorance, we do not even understand what are the real “x”, “y”
and “z” axes, so we select three camera axes {~a, ~b, ~c} at a little arbitrary slants alongside
respect to the system. The slants amid our measurements could not even be 90o! Now, we
record alongside the cameras for 2 minutes. The large question remains: how do we become
from this data set to a easy equation of x?
The Goal: Main constituent scrutiny computes the most accurate basis to represent a loud,
dirty data set. The problem is that it will filter out the sound and will expose hidden
dynamics. The aim of PCA is to present : “the dynamics are alongside the x-axis.” In
supplementary words, the aim of PCA is to make assumption that ˆx - the constituent basis
vector alongside the x-axis - is the vital dimension. Ascertaining this fact permits an
experimenter to discern that dynamics are vital and that are redundant.
4.4.10 KDD CUP 99 DATA SET DESCRIPTIONS
KDD’99 is the the most widely used data set for intrusion detection procedures . The data
set is corrected and is crafted established on the data collected in DARPA’98 IDS evaluation
programs. The DARPA’98 contains 4 gigabytes of compressed binary data of 7 weeks of
overall web traffic, which is processed into concerning 5 million records of connection,
every single record contains 100 bytes of traffic data. 2 million connection records has been
examined in two weeks. The KDD training dataset encompasses of concerning 4,900,000
37
solitary connection vectors, every single vector of that contains 41 features and is classified
as normal or an attack, alongside precisely one particular type of attack. The attacks fall in of
the pursuing four groups:
1) DoS : in this attack the attacker makes a little bit process computing or makes the
resource unavailable or too maximum to grasp highest demands, or may completely
denies users admission to a contraption.
2) U2R: This type of attack the vulnerability in order to become the administrator of
the victim computer . This is achieved by passwords sniffing , a lexicon attack, or
communal engineering..
3) R2L: occurs after an attacker who has the skill to dispatch packets to a contraption
above a web but who does not have an report on that contraption exploits a little
vulnerability to gain innate admission as a user of that contraption.
4) Probing Attack: in this type of attack we collects information regarding a broad
network of machines . And the information is used to compromise its security
controls.
The KDD’99 CUP dataset features can be categorized into main three categories:
1) Basic features: in this group, the attributes of a TCP/IP connection are extracted .
Among all these features many leads to a delay in the detection.
2) Traffic features: it encompasses those features which were computed using
window interval This group is divided into two categories:
a) same host features: The main aim is to examine the connection established in
past 2 second having exact destination host which the current connection is
holding. The statistics of protocol service, behavior etc is calculated
b) “same service” features: The main aim is to examine the connection established
in past 2 second having same service which the current connection is holding..
Above mentioned two kinds of “traffic” are mainly based on time . Though, there are
countless sluggish aggressions which scan the ports employing a far high time period than
2 seconds, e,g, one in each single minute. The consequences of these aggressions may not
present intrusion outlines alongside a period window of time two seconds. In order to
resolve this setback, the “same host” and “same service” features are calculated and
established on the connection window containing 100 connections as a replacement of
period window of two seconds. These are termed asconnection-based traffic features.
38
3) Content features: R2L as well as U2R attack types don’t bare sequential patterns as
in most Pribing and DoS attacks. On the other hand DoS and Probing type attacks
mainly contains countless connections to a small host(s) in a very short spain of
time; R2L as well as U2R attacks are mainly embedded in the data servings of the
TCP/IP packets, these involve highly a single connection. In order to notice these
kinds of attacks , a little features able to gaze for dubious deeds in the data serving,
e.g., number of floundered login attempts.
4.4.11 Experimental Setup:
A. Input Database: KDD CUP 99
B. PCA based feature extraction
C. KNN Classifiers.
D. Matlab.
E. Genetic Programming toolkit
39
4.4.12Terms:
ROC: It stands for receiver operating curve and is a graphical plot that determines the
performance of a binary classifier as the threshold varies. It is obtained by plotting the
true positive rate and false positive rate as the threshold Discriminant varies.
True positive rate or sensitivity is given by:
TPR=
TP
P
=
TP
TP+PN
Specificity or True Negative Rate
TNR=
TN
N
=
TN
FP+TN
Positive Predictive Value or Precision
PPV=
TP
TP+FP
Negative Predictive Value
NPV=
TN
TN+FN
False Positive Rate
FPR=
FP
N
=
FP
FP+TN
= 1- Spc
False Negative Rate
FNR =
FN
P
=
FN
FN+TP
Accuracy =
TP+TN
P+N
.
40
CHAPTER 5
Results and Discussion
5.1 GP based Optimization
1. The algorithm begins by creating a random initial population.
2. The algorithm then creates a sequence of new populations. At each step, the algorithm
uses the individuals in the current generation to create the next population. To create
the new population, the algorithm performs the following steps:
a. Scores each member of the current population by computing its fitness value.
b. Scales the raw fitness scores to convert them into a more usable range of
values.
c. Selects members, called parents, based on their fitness.
d. Some of the individuals in the current population that have lower fitness are
chosen as elite. These elite individuals are passed to the next population.
e. Produces children from the parents. Children are produced either by making
random changes to a single parent—mutation—or by combining the vector
entries of a pair of parents—crossover.
f. Replaces the current population with the children to form the next generation.
41
Fig 5.1 : Current Best individual of classifier optimization using Genetic Programing
Fig 5.2 : Number of Childer for Selectin per individual
 Generations — The algorithm stops when the number of generations reaches the
value of Generations.
 Time limit — The algorithm stops after running for an amount of time in seconds
equal to Time limit.
 Fitness limit — The algorithm stops when the value of the fitness function for the
best point in the current population is less than or equal toFitness limit.
 Stall generations — The algorithm stops when the average relative change in the
fitness function value over Stall generations is less thanFunction tolerance.
 Stall time limit — The algorithm stops if there is no improvement in the objective
function during an interval of time in seconds equal to Stall time limit.
 Stall test — The stall condition is whichever average change or geometric weighted.
For geometric weighted, the weighting purpose is 1/2n, whereas n is the number of
generations prior to the current. Both stall conditions apply to the comparative change
in the fitness purpose above Stall generations.
42
Fig 5.3 : Stopping criteria
Fig 5.4 : As generations grow average distance between individual drops significantly
Table 5.1: Optimization Tabel used for improving accuracu of NN Classifier using GP
records folds K Model Name
Accuracy
(%) Time(mins)
1000 10 50 3 91.63194597 20.02204587
2000 9 33 2 95.56515682 24.58807926
3000 9 44 1 92.84629231 20.33630915
4000 6 32 2 97.8597933 29.40310606
43
5000 1 49 3 98.42251766 18.75813973
6000 8 49 1 91.22030836 17.1748795
7000 6 50 2 96.65303874 15.18687511
8000 10 45 1 98.18732645 38.65139885
9000 3 14 3 96.26225582 26.98995545
10000 6 21 3 91.39581563 40.5710283
11000 2 46 1 96.55104887 30.96600191
12000 7 41 1 98.45506164 29.24735065
13000 4 34 3 91.11947658 39.82860323
14000 9 14 1 98.23991181 27.07765661
15000 5 39 1 99.08721672 29.2388353
16000 4 11 3 96.94348921 37.3515081
17000 2 35 1 98.03364981 18.86148666
18000 3 31 1 95.04835363 20.01646052
19000 9 37 1 91.91128127 28.84630711
20000 3 16 1 97.78661687 32.06463708
21000 10 48 2 91.4506003 27.44625745
22000 5 33 2 93.55287676 21.8182112
23000 8 23 2 95.88219766 15.99263929
24000 4 28 2 94.10930793 26.05294057
25000 7 35 1 92.14031026 39.84650131
26000 1 11 3 98.14608682 34.30178114
27000 9 39 3 92.52787595 22.83691867
28000 6 38 3 91.54489232 31.22483872
29000 9 31 3 96.85810518 30.88720437
30000 6 14 1 92.60184764 39.34734564
31000 9 17 3 95.78801676 32.97699579
32000 5 36 3 92.57898704 22.20743218
33000 9 48 1 98.08208141 27.71891095
34000 5 24 2 98.3403275 24.44279958
35000 1 28 3 91.46325454 16.68770649
36000 10 12 2 92.43179309 33.64797178
44
37000 1 31 3 98.88151236 27.27911689
38000 1 50 3 96.13163571 17.72974834
39000 5 22 1 97.02544502 34.0623758
40000 3 12 2 96.15792541 30.01084341
41000 8 33 1 92.6072532 34.91882885
42000 6 30 2 95.37924219 38.42211453
43000 4 15 1 92.30750405 32.68717421
44000 7 28 3 94.49452536 40.36206059
45000 7 27 3 95.45092271 16.34874981
46000 5 27 1 96.64010756 40.30618937
47000 2 32 2 92.37634594 31.34647659
48000 2 36 1 91.23987267 21.16304196
49000 3 43 3 97.40824637 37.21525023
50000 1 12 1 97.17706732 18.53949767
51000 9 42 1 96.01816413 30.1555764
52000 2 38 3 92.48792584 17.66309901
53000 10 44 2 94.46830957 22.97369953
54000 7 30 2 98.10097504 40.13554461
55000 4 31 1 97.95035846 21.20222369
56000 2 40 3 97.13868828 35.6901613
57000 2 48 3 93.55654996 16.55046968
58000 4 20 3 92.38226205 31.27052777
59000 5 37 2 93.48526928 32.75813635
60000 1 33 3 98.84530446 32.4376096
61000 1 17 1 96.28232308 32.10352485
62000 3 34 2 96.97965357 33.72651938
63000 4 48 1 98.89024412 29.92691276
64000 3 25 3 92.82758822 21.51346239
65000 6 15 1 93.60358509 17.88959377
66000 5 11 1 94.80369651 36.65892868
67000 10 17 2 96.40250124 36.71025453
68000 1 27 3 91.35014933 24.14168778
45
69000 6 10 3 93.83818346 16.49742118
70000 8 31 2 95.1970614 20.41371837
71000 3 10 2 96.76017653 37.57928148
72000 6 26 2 96.36861206 17.56327725
73000 4 27 1 98.24578044 38.16389526
74000 7 24 1 93.77139385 18.22701564
75000 1 30 1 91.69639297 23.04492839
76000 4 29 2 96.51773395 24.76764888
77000 9 44 1 98.5190595 39.95662766
78000 7 46 1 94.7629563 17.26677531
79000 1 29 3 94.85044372 38.49637397
80000 2 29 1 98.62917285 40.85751541
81000 4 46 2 93.24627013 35.68743522
82000 5 11 1 99.71532399 28.52575177
83000 8 30 2 95.62676147 23.42587948
84000 3 14 1 92.66271567 24.37432495
85000 6 18 1 95.29526776 22.80211135
86000 8 14 2 92.46144523 20.48319044
87000 2 27 2 94.56656978 35.45920674
88000 2 42 2 97.89826853 39.36945694
89000 1 29 3 92.26531075 33.77601159
90000 4 35 2 94.83214801 38.84197587
91000 8 32 1 91.21519486 22.31646413
92000 3 40 1 91.71358899 39.89790079
93000 5 33 2 98.46424145 34.74886128
94000 7 34 3 94.08572974 23.59100755
95000 5 37 1 98.35090909 35.04958557
96000 1 35 1 97.92837342 26.35092419
97000 4 39 2 94.66189627 26.24445744
98000 8 20 3 98.20442344 31.96238143
99000 8 40 2 93.89936277 15.30406946
100000 4 15 2 97.61285491 31.05023913
46
5.2 Ensemble approach
In this way, we early craft hybrid classifiers individually to attain a good generalization
presentation (optimizing the ideal for presentation on unseen data rather than the training
data). Examination data is bypassed across every single individual ideal and the
corresponding outputs are utilized to choose the final output. Empirical aftermath delineate
that the counselled ensemble way gives larger presentation for noticing probes and U2R
aggressions than all the three individual models. The Ensemble way classifies most of them
accurately by picking up all the classes that are accurately categorized by all the three
classifiers. As anticipated the ensemble way exploits the contrasts in misclassification and
enhances the finished performance. As evident, all the classifiers believed so distant might
not present well for noticing all the attacks. To seize supremacy of the presentation of the
disparate classifiers a hierarchical hybrid intelligent arrangement is counselled as delineated.
Fig 5.5: Scatter Plot of Host rat with Count For Class using KNN
47
Fig 5.6 : Scatter Plot of bytes transferred with bytes received for Class using KNN
Fig 5.7 : Classification Results using Ensemble of Classifiers
48
In statistics, a receiver working characteristic (ROC), or ROC arc, is a graphical plot that
illustrates the presentation of a binary classifier arrangement as its discrimination threshold is
varied. The arc is crafted by plotting the real affirmative rate (TPR) opposing the fake
affirmative rate (FPR) at assorted threshold settings. The true-positive rate is additionally
recognized as sensitivity or the sensitivity index d', recognized as "d-prime" in gesture
detection and biomedical informatics, or recall in contraption learning. Figure below displays
the ROC Arc for GP established NN Classifier.
Fig 5.8 : ROC curve for GP based Classifier showing 0.99976 area under the curve
The algorithm creates KNN a using training data set, in order to maximize the chances of
detection from training data set and from each class it will reduce errors . Model was
developed on the basis of KDD CUP data set. First decision tree based on the KDD CUP
training data subset and the second on the basis of KDD CUP testing data subset was
created. The decision tree developed using KDD KUP training data subset was tested over
KDD CUP testing data subset and the vice versa. After creation of the K-NN models for
49
U2R and R2L type attacks, Using the Nearest Neighbour rules utility, optimized rules were
extracted..
Normal Packets Training Testing
Data Subset KDD 10%
corrected
KDD 10%
corrected
Normal Packets 99.99% 99.98%
Table 5.2: Comparison of GP Ensemble NN Performance for the R2L Attack Category
fig 5.9 Confusion matrix for normal records
50
Fig 5.10 Comparison of GP Ensemble NN Performance for Normal Category
U2R Record Detection
Rate
Training Testing
Data Subset KDD 10%
corrected
KDD 10%
corrected
Not-U2R Record Detection
Rate
99.95% 99.91%
U2R Record Detection Rate 96.92% 96.14%
Table 5.3 : Comparison of GP Ensemble NN Performance for the U2R Attack Category
99.97%
99.98%
99.98%
99.98%
99.98%
99.98%
99.99%
99.99%
99.99%
99.99%
KDD 10% corrected KDD 10% corrected
Training Testing
Normal Packets
51
Fig 5.11 Confusion matrix for U2R type attacks
Fig 5.12 Comparison of GP Ensemble NN Performance for the U2R Attack Category
52
R2L Record Detection
Rate
Training Testing
Data Subset KDD 10%
corrected
KDD 10%
corrected
Not-R2L Record Detection
Rate
98.12% 97.89%
R2L Record Detection Rate 98.34% 99.87%
Table 5.4: Comparison of GP Ensemble NN Performance for the R2L Attack Category
Fig 5.13 Confusion matrix for R2L
53
Fig 5.14 Comparison of GP Ensemble NN Performance for the R2L Attack Category
Probing Record Detection
Rate
Training Testing
Data Subset KDD 10%
corrected
KDD 10%
corrected
Not-probing Record
Detection Rate
98.13% 98.44%
probing Record Detection
Rate
94.56% 93.59%
Table 5.5 : Comparison of GP Ensemble NN Performance for the probe Attack
Category
97.60%
97.70%
97.80%
97.90%
98.00%
98.10%
98.20%
98.30%
98.40%
Training Testing
Not-R2L Record Detection Rate R2L Record Detection Rate
54
Fig 5.15 Confusion matrix for probe type attack
.
Fig 5.16 Comparison of GP Ensemble NN Performance for the probe Attack Category
0
0.2
0.4
0.6
0.8
1
1.2
Training Testing
Data Subset Not-probing Record Detection Rate
probing Record Detection Rate
55
CHAPTER 6
CONCLUSION AND REFRENCES
6.1 CONCLUSION:
To addresses various issues we proposed a system i.e ensemble using genetic program which
will have a better performance as compared to others. In this paper we ensemble only NN
classifiers. Few carried out on resembling heterogeneous type to make out good results. This
paper has address many issues which are creating trouble for designing effective classifiers.
The paper discussed the GP in detailed and how can be a classifier of better performance be
developed. In short in ensemble of classifiers using genetic programming a lot of human
expert requirement has been decreased and an automatic system has been developed. As
depicted from results when the models were trained on four folds ,the information required
to achieve desirable detection rate was very high. Only 80% detection rate was achived by the
algorithms which were tested in literature for R2L as well as U2R attacks. If we increase the
number of records to 99.98% of the training data subset in the testing data subset the
detection rate for R2L and U2R type attacks will increase to 99%.
Over last decades pattern recognition approach in intrusion detection ,attracted a lot of
interests. The demand for reliable and sophisticated intrusion systems has been increased for
detection of polymorphous attacks. In this paper , We have presented a novel type intrusion
detection approach which uses Genetic programming based Ensemble approach for detecting
intrusion detection. The experimental results demonstrate that the GP base Ensemble
Classifier is effective for reducing false alarm information such that the widespread IDS
systems can be implemented using our approach considering both accuracy and
interpretability. In future Feature selection can be used not only to alleviate the curse of
dimensionality and minimize classification errors, but also to improve the interpretability of
Ensemble-based classifiers. Our future work will focus on reducing features for the classifiers
by methods of feature selection. Also, the work will be continued to study the fitness function
of the genetic algorithm to manipulate more parameters of the fuzzy inference module, even
concentrating on fuzzy rules themselves
6.2 Future Scope:
Various papers counseled assorted methodologies for ensemble of classifiers. The ensemble
so gave was for a specific request field. No mistrust these arrangements have a lot of gains,
but they were possessing limitations too. The assorted limitations like diversity, computation
period, detection ratio and fake alarm were the main issues. If one arrangement addresses one
56
subject rest supplementary were flouted that was the bigger obligation of these systems.
Generally these arrangement did not work well on unbalanced data. To addresses these all
subjects we counseled a arrangement i.e ensemble employing genetic plan that will have a
larger presentation as contrasted to others. In this paper we ensemble merely NN classifiers.
Upcoming work will be grasped out on resembling heterogeneous kind to make out good
results. This paper will address countless subjects that are crafting concern for arranging
competent classifiers. The paper debated the GP in methodical and how can be a classifier of
larger presentation be developed. In short in ensemble of classifiers employing genetic
software design a lot of human expert necessity has been cut and an automatic arrangement
has been industrialized.
6.3 REFRENCES:
1. Gianluigi Folino, Giandomenico Spezzano and Clara Pizzuti, Ensemble
Techniquesfor parallel Genetic Programming based classifier.
2. Durga Prasad Muni, Nikhil R. Pal, Senior Member, IEEE, and Jyotirmoy Das, APRIL
2004, A Novel Approach to Design ClassifiersUsing Genetic Programming,IEEE
TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 8, NO. 2,
APRIL 2004.
3. Michał Woz´niak, Manuel Grana, Emilio Corchado,2014,A survey of multiple
classifier systems as hybrid systems, ELSEVIER.
4. Mark Crosbie, Eugene H. Spafford,1995, Applying Genetic Programming to Intrusion
Detection, AAAI Technical Report FS-95-01. Compilation copyright © 1995, AAAI
(www.aaai.org).
5. K. M. FARAOUN, 2006, GENETIC PROGRAMMING APPROACH FORMULTI-
CATEGORY PATTERN CLASSIFICATION APPLIED TO NETWORK
INTRUSIONS DETECTION, International Journal of Computational Intelligence and
ApplicationsVol. 6, No. 1 (2006) 77–99imperial College Press.
6. Gianluigi Folino, Clara Pizzuti and Giandomenico Spezzano, GP Ensemble
forDistributed Intrusion Detection Systems, ICAR-CNR,Via P.Bucci 41/C,Univ. Della
Calabria87036 Rende (CS), Italy.
57
7. NIUSVEL ACOSTA-MENDOZA, ALICIA MORALES-REYES,HUGO JAIR
ESCALANTE and ANDRÉS GAGO-ALONSO,2014,LEARNING TO ASSEMBLE
CLASSIFIERS VIAGENETIC PROGRAMMING, International Journal of Pattern
Recognitionand Arti¯cial IntelligenceVol. 28, No. 7 (2014) 1460005 (19 pages)#.c
World Scienti¯c Publishing Company.
8. Xihua Li, Fuqiang Wang and Xiaohong Chen,2015, Support Vector Machine
Ensemble Based on ChoquetIntegral for Financial Distress Prediction, International
Journal of Pattern Recognitionand Arti¯cial IntelligenceVol. 29, No. 4 (2015) 1550016
(24 pages)#.c World Scienti¯c Publishing Company.
9. Preeti Aggarwala,, Sudhir Kumar Sharma,2015, Analysis of KDD Dataset Attributes -
Class wise For IntrusionDetection, 3rd International Conference on Recent Trends in
Computing 2015 (ICRTC-2015).
10. Anup Goyal, Chetan Kumar, GA-NIDS: A Genetic Algorithm based Network
Intrusion Detection System.
11. Urvesh Bhowan, Mark Johnston, Member, IEEE, Mengjie Zhang, Senior Member,
IEEE, and Xin Yao, Fellow, IEEE, JUNE 2013, Evolving Diverse Ensembles Using
GeneticProgramming for Classification With
Unbalanced Data, IEEE TRANSACTIONS ON EVOLUTIONARY
COMPUTATION, VOL. 17, NO. 3, JUNE 2013.
12. Dr. Saurabh Mukherjeea, Neelam Sharma, ( 2012 ), Intrusion Detection using Naive
Bayes Classifier with Feature Reduction.
13. H Nguyen, K Franke, S Petrovic Improving Effectiveness of Intrusion Detection by
CorrelationFeature Selection, 2010 International Conference on Availability,
Reliability and Security,IEEE.
14. Ms.Nivedita Naidu, Dr.R.V.Dharaskar “An effective approach to network intrusion
detectionsystem using genetic algorithm”, International Journal of Computer
Applications (0975 – 8887) Volume 1 – No. 2, 2010.
15. N. Chawla and J. Sylvester, “Exploiting diversity in ensembles: Improving the
performance on unbalanced datasets,” in Proc. 7th Int. Conf. MCS, 2007, pp. 397–406.
16. H. Abbass, “Pareto-optimal approaches to neuro-ensemble learning,” in Multi-
Objective Machine Learning (Studies in Computational Intelligence, vol. 16), Y. Jin,
Ed. Berlin/Heidelberg, Germany: Springer, 2006, pp. 407–427.
58
17. U. Bhowan, M. Zhang, and M. Johnston, “Genetic programming for classification with
unbalanced data,” in Proc. 13th Eur. Conf. Genet.Programming, LNCS 6021. 2010,
pp. 1–13.
18. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The
WEKA data mining software: An update,” SIGKDDExplorations, vol. 11, no. 1, pp.
10–18, Nov. 2009.
19. Z. Wang, K. Tang, and X. Yao, “Multiobjective approaches to optimal testing resource
allocation in modular software systems,” IEEE Trans.Reliab., vol. 59, no. 3, pp. 563–
575, Sep. 2010.
20. C. Coello Coello, G. Lamont, and D. Veldhuizen, Evolutionary Algorithmsfor Solving
Multi-Objective Problems (Genetic and EvolutionaryComputation Series). Berlin,
Germany: Springer, 2007.
21. I Ahmad, A B Abdulah, A S Alghamdi, K Alnfajan,M Hussain, Feature Subset
Selection forNetwork Intrusion Detection Mechanism Using Genetic Eigen Vectors,
Proc .of CSIT vol.5(2011).
22. Saman M. Abdulla, Najla B. Al-Dabagh, Omar Zakaria, Identify Features and
Parameters to Devise an Accurate Intrusion Detection System Using Artificial Neural
Network, World Academy of Science, Engineering and Technology 2010.
23. NSL-KDD dataset for network –based intrusion detection systems” available on
http://iscx.info/NSL-KDD/.
24. http://www.cs.waikato.ac.nz/~ml /weka/.
25. Gianluigi Folino, Clara Pizzuti, Giandomenico Spezzano. 2010. An ensemble-
basedevolutionary framework for coping with distributed intrusion detection.
GeneticProgramming and Evolvable Machines 11, 131-146.
26. Shelly Xiaonan Wu, Wolfgang Banzhaf. 2010. The use of computational intelligence
inintrusion detection systems: A review. Applied Soft Computing 10, 1-35.
[CrossRef].
27. Ahmad Taher Azar, Hanaa Ismail Elshazly, Aboul Ella Hassanien, Abeer Mohamed
Elkorany. 2013. A random forest classifier for lymph diseases. Computer Methods
andPrograms in Biomedicine.
28. P. Ravisankar, V. Ravi, G. Raghava Rao, I. Bose. 2011. Detection of financial
statement fraud And feature selection using data mining techniques. Decision Support
Systems 50, 491-500.
59

Weitere ähnliche Inhalte

Was ist angesagt?

Query Aware Determinization of Uncertain Objects
Query Aware Determinization of Uncertain ObjectsQuery Aware Determinization of Uncertain Objects
Query Aware Determinization of Uncertain Objects1crore projects
 
copy for Gary Chin.
copy for Gary Chin.copy for Gary Chin.
copy for Gary Chin.Teng Xiaolu
 
LINK MINING PROCESS
LINK MINING PROCESSLINK MINING PROCESS
LINK MINING PROCESSIJDKP
 
Anomaly detection via eliminating data redundancy and rectifying data error i...
Anomaly detection via eliminating data redundancy and rectifying data error i...Anomaly detection via eliminating data redundancy and rectifying data error i...
Anomaly detection via eliminating data redundancy and rectifying data error i...nalini manogaran
 
Associative Regressive Decision Rule Mining for Predicting Customer Satisfact...
Associative Regressive Decision Rule Mining for Predicting Customer Satisfact...Associative Regressive Decision Rule Mining for Predicting Customer Satisfact...
Associative Regressive Decision Rule Mining for Predicting Customer Satisfact...csandit
 
Selecting the correct Data Mining Method: Classification & InDaMiTe-R
Selecting the correct Data Mining Method: Classification & InDaMiTe-RSelecting the correct Data Mining Method: Classification & InDaMiTe-R
Selecting the correct Data Mining Method: Classification & InDaMiTe-RIOSR Journals
 
A statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environmentA statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environmentIJDKP
 
Probabilistic Interestingness Measures - An Introduction with Bayesian Belief...
Probabilistic Interestingness Measures - An Introduction with Bayesian Belief...Probabilistic Interestingness Measures - An Introduction with Bayesian Belief...
Probabilistic Interestingness Measures - An Introduction with Bayesian Belief...Adnan Masood
 
MOVIE SUCCESS PREDICTION AND PERFORMANCE COMPARISON USING VARIOUS STATISTICAL...
MOVIE SUCCESS PREDICTION AND PERFORMANCE COMPARISON USING VARIOUS STATISTICAL...MOVIE SUCCESS PREDICTION AND PERFORMANCE COMPARISON USING VARIOUS STATISTICAL...
MOVIE SUCCESS PREDICTION AND PERFORMANCE COMPARISON USING VARIOUS STATISTICAL...ijaia
 
Clustering Prediction Techniques in Defining and Predicting Customers Defecti...
Clustering Prediction Techniques in Defining and Predicting Customers Defecti...Clustering Prediction Techniques in Defining and Predicting Customers Defecti...
Clustering Prediction Techniques in Defining and Predicting Customers Defecti...IJECEIAES
 
First Year Report, PhD presentation
First Year Report, PhD presentationFirst Year Report, PhD presentation
First Year Report, PhD presentationBang Xiang Yong
 
IRJET- Analyzing Voting Results using Influence Matrix
IRJET- Analyzing Voting Results using Influence MatrixIRJET- Analyzing Voting Results using Influence Matrix
IRJET- Analyzing Voting Results using Influence MatrixIRJET Journal
 
Opinion mining framework using proposed RB-bayes model for text classication
Opinion mining framework using proposed RB-bayes model for text classicationOpinion mining framework using proposed RB-bayes model for text classication
Opinion mining framework using proposed RB-bayes model for text classicationIJECEIAES
 
Survey on Software Data Reduction Techniques Accomplishing Bug Triage
Survey on Software Data Reduction Techniques Accomplishing Bug TriageSurvey on Software Data Reduction Techniques Accomplishing Bug Triage
Survey on Software Data Reduction Techniques Accomplishing Bug TriageIRJET Journal
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Prediction of Euro 50 Using Back Propagation Neural Network (BPNN) and Geneti...
Prediction of Euro 50 Using Back Propagation Neural Network (BPNN) and Geneti...Prediction of Euro 50 Using Back Propagation Neural Network (BPNN) and Geneti...
Prediction of Euro 50 Using Back Propagation Neural Network (BPNN) and Geneti...AI Publications
 
Web Based Fuzzy Clustering Analysis
Web Based Fuzzy Clustering AnalysisWeb Based Fuzzy Clustering Analysis
Web Based Fuzzy Clustering Analysisinventy
 
Bayesian Networks and Association Analysis
Bayesian Networks and Association AnalysisBayesian Networks and Association Analysis
Bayesian Networks and Association AnalysisAdnan Masood
 

Was ist angesagt? (19)

Query Aware Determinization of Uncertain Objects
Query Aware Determinization of Uncertain ObjectsQuery Aware Determinization of Uncertain Objects
Query Aware Determinization of Uncertain Objects
 
copy for Gary Chin.
copy for Gary Chin.copy for Gary Chin.
copy for Gary Chin.
 
LINK MINING PROCESS
LINK MINING PROCESSLINK MINING PROCESS
LINK MINING PROCESS
 
Anomaly detection via eliminating data redundancy and rectifying data error i...
Anomaly detection via eliminating data redundancy and rectifying data error i...Anomaly detection via eliminating data redundancy and rectifying data error i...
Anomaly detection via eliminating data redundancy and rectifying data error i...
 
Associative Regressive Decision Rule Mining for Predicting Customer Satisfact...
Associative Regressive Decision Rule Mining for Predicting Customer Satisfact...Associative Regressive Decision Rule Mining for Predicting Customer Satisfact...
Associative Regressive Decision Rule Mining for Predicting Customer Satisfact...
 
Selecting the correct Data Mining Method: Classification & InDaMiTe-R
Selecting the correct Data Mining Method: Classification & InDaMiTe-RSelecting the correct Data Mining Method: Classification & InDaMiTe-R
Selecting the correct Data Mining Method: Classification & InDaMiTe-R
 
A statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environmentA statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environment
 
Probabilistic Interestingness Measures - An Introduction with Bayesian Belief...
Probabilistic Interestingness Measures - An Introduction with Bayesian Belief...Probabilistic Interestingness Measures - An Introduction with Bayesian Belief...
Probabilistic Interestingness Measures - An Introduction with Bayesian Belief...
 
MOVIE SUCCESS PREDICTION AND PERFORMANCE COMPARISON USING VARIOUS STATISTICAL...
MOVIE SUCCESS PREDICTION AND PERFORMANCE COMPARISON USING VARIOUS STATISTICAL...MOVIE SUCCESS PREDICTION AND PERFORMANCE COMPARISON USING VARIOUS STATISTICAL...
MOVIE SUCCESS PREDICTION AND PERFORMANCE COMPARISON USING VARIOUS STATISTICAL...
 
Clustering Prediction Techniques in Defining and Predicting Customers Defecti...
Clustering Prediction Techniques in Defining and Predicting Customers Defecti...Clustering Prediction Techniques in Defining and Predicting Customers Defecti...
Clustering Prediction Techniques in Defining and Predicting Customers Defecti...
 
First Year Report, PhD presentation
First Year Report, PhD presentationFirst Year Report, PhD presentation
First Year Report, PhD presentation
 
IRJET- Analyzing Voting Results using Influence Matrix
IRJET- Analyzing Voting Results using Influence MatrixIRJET- Analyzing Voting Results using Influence Matrix
IRJET- Analyzing Voting Results using Influence Matrix
 
Opinion mining framework using proposed RB-bayes model for text classication
Opinion mining framework using proposed RB-bayes model for text classicationOpinion mining framework using proposed RB-bayes model for text classication
Opinion mining framework using proposed RB-bayes model for text classication
 
Survey on Software Data Reduction Techniques Accomplishing Bug Triage
Survey on Software Data Reduction Techniques Accomplishing Bug TriageSurvey on Software Data Reduction Techniques Accomplishing Bug Triage
Survey on Software Data Reduction Techniques Accomplishing Bug Triage
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Prediction of Euro 50 Using Back Propagation Neural Network (BPNN) and Geneti...
Prediction of Euro 50 Using Back Propagation Neural Network (BPNN) and Geneti...Prediction of Euro 50 Using Back Propagation Neural Network (BPNN) and Geneti...
Prediction of Euro 50 Using Back Propagation Neural Network (BPNN) and Geneti...
 
Web Based Fuzzy Clustering Analysis
Web Based Fuzzy Clustering AnalysisWeb Based Fuzzy Clustering Analysis
Web Based Fuzzy Clustering Analysis
 
Bayesian Networks and Association Analysis
Bayesian Networks and Association AnalysisBayesian Networks and Association Analysis
Bayesian Networks and Association Analysis
 
final
finalfinal
final
 

Andere mochten auch

BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterLinaro
 
Arraigado en ti
Arraigado en tiArraigado en ti
Arraigado en tijumelmc
 
REP093 Non timber forest products and SFM Tabalong (2), J Pa
REP093 Non timber forest products and SFM Tabalong (2), J PaREP093 Non timber forest products and SFM Tabalong (2), J Pa
REP093 Non timber forest products and SFM Tabalong (2), J PaJunaidi Payne
 
Algunas lecciones aprendidas en la transformación de una compañía aérea
Algunas lecciones aprendidas en la transformación de una compañía aéreaAlgunas lecciones aprendidas en la transformación de una compañía aérea
Algunas lecciones aprendidas en la transformación de una compañía aéreaAgilar
 
Arquivologia para concursos 2ª ed. renato valentini
Arquivologia para concursos 2ª ed. renato valentiniArquivologia para concursos 2ª ed. renato valentini
Arquivologia para concursos 2ª ed. renato valentiniBob Marlei
 
Procesos y tecnologias de las telecomunicaciones - Mora Marabi
Procesos y tecnologias de las telecomunicaciones - Mora MarabiProcesos y tecnologias de las telecomunicaciones - Mora Marabi
Procesos y tecnologias de las telecomunicaciones - Mora MarabiMora Marabi
 

Andere mochten auch (11)

BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
 
Arraigado en ti
Arraigado en tiArraigado en ti
Arraigado en ti
 
Tutorial
TutorialTutorial
Tutorial
 
tupac amaru
tupac amarutupac amaru
tupac amaru
 
Planeacion tics
Planeacion ticsPlaneacion tics
Planeacion tics
 
REP093 Non timber forest products and SFM Tabalong (2), J Pa
REP093 Non timber forest products and SFM Tabalong (2), J PaREP093 Non timber forest products and SFM Tabalong (2), J Pa
REP093 Non timber forest products and SFM Tabalong (2), J Pa
 
Algunas lecciones aprendidas en la transformación de una compañía aérea
Algunas lecciones aprendidas en la transformación de una compañía aéreaAlgunas lecciones aprendidas en la transformación de una compañía aérea
Algunas lecciones aprendidas en la transformación de una compañía aérea
 
La gravitazione
La gravitazioneLa gravitazione
La gravitazione
 
Caracteristicas de la natación (1)
Caracteristicas de la natación (1)Caracteristicas de la natación (1)
Caracteristicas de la natación (1)
 
Arquivologia para concursos 2ª ed. renato valentini
Arquivologia para concursos 2ª ed. renato valentiniArquivologia para concursos 2ª ed. renato valentini
Arquivologia para concursos 2ª ed. renato valentini
 
Procesos y tecnologias de las telecomunicaciones - Mora Marabi
Procesos y tecnologias de las telecomunicaciones - Mora MarabiProcesos y tecnologias de las telecomunicaciones - Mora Marabi
Procesos y tecnologias de las telecomunicaciones - Mora Marabi
 

Ähnlich wie Final Report

churn_detection.pptx
churn_detection.pptxchurn_detection.pptx
churn_detection.pptxDhanuDhanu49
 
IRJET- Analysis of Rating Difference and User Interest
IRJET- Analysis of Rating Difference and User InterestIRJET- Analysis of Rating Difference and User Interest
IRJET- Analysis of Rating Difference and User InterestIRJET Journal
 
Review of Algorithms for Crime Analysis & Prediction
Review of Algorithms for Crime Analysis & PredictionReview of Algorithms for Crime Analysis & Prediction
Review of Algorithms for Crime Analysis & PredictionIRJET Journal
 
Predictive job scheduling in a connection limited system using parallel genet...
Predictive job scheduling in a connection limited system using parallel genet...Predictive job scheduling in a connection limited system using parallel genet...
Predictive job scheduling in a connection limited system using parallel genet...Mumbai Academisc
 
rpaper
rpaperrpaper
rpaperimu409
 
Review on Algorithmic and Non Algorithmic Software Cost Estimation Techniques
Review on Algorithmic and Non Algorithmic Software Cost Estimation TechniquesReview on Algorithmic and Non Algorithmic Software Cost Estimation Techniques
Review on Algorithmic and Non Algorithmic Software Cost Estimation Techniquesijtsrd
 
Feature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering TechniquesFeature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering TechniquesIRJET Journal
 
A Hierarchical Feature Set optimization for effective code change based Defec...
A Hierarchical Feature Set optimization for effective code change based Defec...A Hierarchical Feature Set optimization for effective code change based Defec...
A Hierarchical Feature Set optimization for effective code change based Defec...IOSR Journals
 
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...IRJET Journal
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsDinusha Dilanka
 
Cyb 5675 class project final
Cyb 5675   class project finalCyb 5675   class project final
Cyb 5675 class project finalCraig Cannon
 
CLASSIFIER SELECTION MODELS FOR INTRUSION DETECTION SYSTEM (IDS)
CLASSIFIER SELECTION MODELS FOR INTRUSION DETECTION SYSTEM (IDS)CLASSIFIER SELECTION MODELS FOR INTRUSION DETECTION SYSTEM (IDS)
CLASSIFIER SELECTION MODELS FOR INTRUSION DETECTION SYSTEM (IDS)ieijjournal1
 
Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5
Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5
Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5ssuser33da69
 
High performance intrusion detection using modified k mean &amp; naïve bayes
High performance intrusion detection using modified k mean &amp; naïve bayesHigh performance intrusion detection using modified k mean &amp; naïve bayes
High performance intrusion detection using modified k mean &amp; naïve bayeseSAT Journals
 
High performance intrusion detection using modified k mean &amp; naïve bayes
High performance intrusion detection using modified k mean &amp; naïve bayesHigh performance intrusion detection using modified k mean &amp; naïve bayes
High performance intrusion detection using modified k mean &amp; naïve bayeseSAT Journals
 
A study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanismsA study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanismseSAT Journals
 
An efficient feature selection algorithm for health care data analysis
An efficient feature selection algorithm for health care data analysisAn efficient feature selection algorithm for health care data analysis
An efficient feature selection algorithm for health care data analysisjournalBEEI
 

Ähnlich wie Final Report (20)

churn_detection.pptx
churn_detection.pptxchurn_detection.pptx
churn_detection.pptx
 
IRJET- Analysis of Rating Difference and User Interest
IRJET- Analysis of Rating Difference and User InterestIRJET- Analysis of Rating Difference and User Interest
IRJET- Analysis of Rating Difference and User Interest
 
Review of Algorithms for Crime Analysis & Prediction
Review of Algorithms for Crime Analysis & PredictionReview of Algorithms for Crime Analysis & Prediction
Review of Algorithms for Crime Analysis & Prediction
 
Predictive job scheduling in a connection limited system using parallel genet...
Predictive job scheduling in a connection limited system using parallel genet...Predictive job scheduling in a connection limited system using parallel genet...
Predictive job scheduling in a connection limited system using parallel genet...
 
rpaper
rpaperrpaper
rpaper
 
Review on Algorithmic and Non Algorithmic Software Cost Estimation Techniques
Review on Algorithmic and Non Algorithmic Software Cost Estimation TechniquesReview on Algorithmic and Non Algorithmic Software Cost Estimation Techniques
Review on Algorithmic and Non Algorithmic Software Cost Estimation Techniques
 
Feature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering TechniquesFeature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering Techniques
 
A Hierarchical Feature Set optimization for effective code change based Defec...
A Hierarchical Feature Set optimization for effective code change based Defec...A Hierarchical Feature Set optimization for effective code change based Defec...
A Hierarchical Feature Set optimization for effective code change based Defec...
 
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
 
Cyb 5675 class project final
Cyb 5675   class project finalCyb 5675   class project final
Cyb 5675 class project final
 
CLASSIFIER SELECTION MODELS FOR INTRUSION DETECTION SYSTEM (IDS)
CLASSIFIER SELECTION MODELS FOR INTRUSION DETECTION SYSTEM (IDS)CLASSIFIER SELECTION MODELS FOR INTRUSION DETECTION SYSTEM (IDS)
CLASSIFIER SELECTION MODELS FOR INTRUSION DETECTION SYSTEM (IDS)
 
Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5
Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5
Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5
 
Ieee doctoral progarm final
Ieee doctoral progarm finalIeee doctoral progarm final
Ieee doctoral progarm final
 
High performance intrusion detection using modified k mean &amp; naïve bayes
High performance intrusion detection using modified k mean &amp; naïve bayesHigh performance intrusion detection using modified k mean &amp; naïve bayes
High performance intrusion detection using modified k mean &amp; naïve bayes
 
High performance intrusion detection using modified k mean &amp; naïve bayes
High performance intrusion detection using modified k mean &amp; naïve bayesHigh performance intrusion detection using modified k mean &amp; naïve bayes
High performance intrusion detection using modified k mean &amp; naïve bayes
 
A study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanismsA study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanisms
 
Dissertation
DissertationDissertation
Dissertation
 
Final proj 2 (1)
Final proj 2 (1)Final proj 2 (1)
Final proj 2 (1)
 
An efficient feature selection algorithm for health care data analysis
An efficient feature selection algorithm for health care data analysisAn efficient feature selection algorithm for health care data analysis
An efficient feature selection algorithm for health care data analysis
 

Final Report

  • 1. 1 CHAPTER 1 1.1 AIM: Optimization And Combination Of K Nearest Neighbours For Intrusion Detection System through Genetic Programming using KDD CUP 1999 dataset. 1.2 OBJECTIVE:  Can GP based numeric classifier show optimized performance than individual NN classifiers?  Can GP based combination technique produce a higher performance OCC as compared to NN component classifiers?  Can heterogeneous combination of NN classifiers produce more promising results as compared to their homogenous ones. 1.3 SCOPE: The scope of this undertaking is wide. It can be utilized not merely for safeguarding web arrangement but can additionally be utilized for noticing frauds in the banks and supplementary ecommerce business. The arrangement projected in this paper will be utilized for supplementary setbacks by employing disparate datasets. Like health data, iris data and so on. 1.4 STRUCTURE: In the pursuing, a description is endowed on how the dissertation is coordinated in understanding the believed of genetic software design and how it can be utilized for optimization and combination of K nearest acquaintances for intrusion detection system. Early chapter includes data concerning the Aim, Scope, and Methodologies. Chapter two provides the introduction concerning genetic programming. In chapter three we debate the works survey. Chapter four gives the methodologies utilized and Fifth chapter is established on conclusion and references. 1.5 Methodology: In this serving we will debate the assorted steps that will be seized for the progress of our project. There will be two periods in development. In early period we will focus on the progress of the optimal KNN classifier employing fitness of individual. In the subsequent period we will focus on the combination of the optimal KNN classifier on the basis of ROC bends obtained.
  • 3. 3 CHAPTER 2 2.1 Abstract: From the onset of web arrangement, protection menaces normally recognized as intrusions have come to be extremely vital and critical subject in web arrangements, data and data system. In order to vanquish these menaces every single period a detection arrangement was demanded because of drastic development in networks. Because of the development of arrangement, attackers came to be stronger and every single period compromises the protection of system. Hence a demand of Intrusion Detection arrangement came to be extremely vital and vital instrument in web security. Detection and prevention of such aggressions shouted intrusions generally depends on the skill and efficiency of Intrusion Detection Arrangement (IDS).As alongside rise in web scalability alongside elevated pace, The demand for light heaviness Intrusion Detection Arrangement alongside elevated detection ratio is a necessity .Therefore countless ensemble mechanism has been counselled by employing countless methodologies’, these methodologies have their own benefits and short comings. In this paper we will focus on ensemble of classifiers employing genetic programming. We will debate how genetic software design is a good way for joining constituent classifiers; next we will debate how to accomplish the highest presentation measures. 2.2 INTRODUCTION: With the advancement in technologies and quick development of webs, there is a demand of becoming functional data from a large volume of data. In period like company, health, engineering and supplementary fields there is a demand for seizing data from assorted input origins and next categorizing them precisely in their corresponding classes. In health, manufacturing, logical and business requests the data generated is becoming extremely convoluted and tough for the humans to make sense out of it. Countless ways are gave by scientists, for computers not to merely produce data but are trying to make computers comprehend concerning the data .The most methods for becoming functional data out of large data are data excavating and vision discovery. The main target of practioners is to design a elevated presentation forecast model. Countless methods have been counseled for data classification. Statistical methods proved functional instruments for data analysing of association problems. Moreover, a collection of supplementary methods have been counseled like expert arrangements, manmade neural webs, furry logic, DT and evolutionary
  • 4. 4 computing. These are extra competent in countless request spans as contrasted to statistical techniques. These intelligent methods are good for decision and proposal larger presentation and can be facilely concentrated to encounter the necessities for decisions. These intelligent methods are extra precise than human specialists. A lot of efforts are being grasped for requesting intelligent techniques. Usually in association, we don’t have each prior vision considering the input data and concerning the representation of final forecast model. Because of these subjects that makes setbacks to intelligent methods, that could optimize presentation of countless association model. Genetic software design methods have been requested in growing requests of outline classifications. In object detection, segmentation, feature extraction and additionally classifications, GP proved effective. The key feature to the accomplishment of intelligent methods is its optimization. There is a lack of finished methods in optimization of association models.GP models are extra flexible and moderately extra general; it provides elevated skill of learning. These features of GP aid in arranging association model.To grasp the optimization setbacks; GP prosperously optimizes assorted association models. A new division of evolutionary computation is Genetic Programming, because of representation of resolution to each setback by genetic software design in a flexible for of computer program; because of this representation it provides an competent resolution to each problem. GP established find proposals new optimization methods in classification. Assorted methods like DT classifier, association law sets can be industrialized employing GP. GP gives a new form to classifiers shouted as numeric classifiers, these classifiers encompass arthematic logical and could be joined alongside conditional statements.GP established numeric classifies are requested in assorted setbacks such as illness diagnosis, intrusion detection, outline matching, company etc. GP is easy flexible facile and influential technique. For optimization setbacks, GP has been extensively utilized, e.g. ideal induction and automatic programming. Various association models have been industrialized for specific problems. Due to the low presentation of these specific request models, GP established ensemble method is retained in this paper that will enhance the presentation of assorted association models. In countless requests, individual classifiers are joined to make maximum benefits. A combination of two or extra models is extra competent than individual one. The shortcoming of one classifier can be facilely substituted by supremacy of other. The individual classifier and there diversity will aid in growing an optimal composite classifier.
  • 5. 5 Several center classifiers are crafted from discovering algorithms. The optimal classifier is obtained by joining the output of constituent classifiers across a combination scheme. The two issues regarding these methodologies are as under: 1. How to create suitable individual classifier 2. The best way to combine them i.e how to combine them These issues will be addressed in this paper GP addresses these subjects by producing purposes across evaluation process. Heuristic established find method will not find a optimal resolution due to intricacy in hunting in a colossal space. GP proposals an ultimate resolution to this problem. GP provides a larger combination of classifiers recognized as numeric classifiers. In this paper nearest acquaintance classifiers are joined for growing a optima classifier for intrusion detection system. The methods established on GP can efficiently join the output obtained from individual classifiers to develop an optimal numeric classifier. This method manipulates a populace of suitable solutions. The main focus of GP is the find space that encompasses the best optimal solution. GP uses its inherent qualities of adaptation, flexibility and generalization. Adaptation helps in monitor in classifier combination and refines the presentation as conditions change. Flexibility helps to work robustly opposing incomplete or inconsistent data. Generalization helps in growing new GP established classifiers that are competent in decision opposing the unseen examples in data alongside no facts. Receiver working characteristics (ROC) arc is utilized for computing presentation of classifier. The classifier could not present well because of overlapping class allocation, for this AUCH of ROC arc is utilized for optimal numeric classifier. In this paper ROC arc is utilized as GP fitness function. The roc arc is utilized for analogy of classifier. A larger ROC arc of composite classifier could lead to a larger decision. 2.3 Numeric expression classier: A numeric expression is each purpose that returns numeric value. The purpose can encompass each mathematical operator i.e. arthematic, logarithmic and supplementary operator. The classifier seizes setback variable as input and produces an output in the form of numeric worth.
  • 6. 6 2.4 GP OPERATORS: Crossover: the crossover operator is the most operators used. It merges the genetic physical from two parent trees and produces two offspring tress. In every single parent a point is randomly selected as cross above point Mutation: it is used to create a random change to learning tree an individual is selected for mutation using fitness proportional selection. Reproduction: a copy is just made for an individual to go into the next, it makes no change to a program. Fig 2.1 Block diagram for genetic programming process yes Perform replication Perform crossover Copy into new population Insert two offspring into new population Individual= individual+1 Individual=individual+2 Individual= individual+1 individual+1 Select one individual based on fitness Select one individual based on fitness Select one individual based on fitness Generate counter=0 Generate intial M individuals Termination criteria satisfied Save the best result Evaluate fitne of each individual Set individual=0 and create new population Individual =MGen=Gen+1 Select genetic operation probability Perform mutation Insert offspring into new population
  • 7. 7 2.5 Genetic Programming For combining Classifiers: The main believed behind the combination of classifier is larger accuracy next individual classifiers. The data excavating and vision area counselled that the combination of individual classifier to develop a meta classifier will frequently enhance the accuracy. There are countless such combinational methods that merge countless classifiers to produce a easy classifier of higher performance. The precarious nature of composite classifier is suitable for ensemble. Precarious way the discovering algorithm whose output adjustments in accountable to a tiny change in training sample. Hansen and Solemon counselled that a combination of classifier is extra precise if and merely if individual classifiers are precise and diverse. In their thesis they proved that if the constituent classifies is autonomous and if error laws are less than 0.5, next error rate of joined classifier will cut alongside rise in number of individual classifier. From this thesis we came to understand that accuracy and diversity of individual classifier is of extra importance. In literature large number of combination have been proposed. The combinations are divided into serial and parallel. Parallel combinations is the most widely used. It consist of a set of component classifier and a best combining algorithm, the combining algorithm combines the output of individual classifier to make a final classifier as shown in fig ck class c2 c1 Most commonly used a parallel combination of classifiers fig 2.2 There are two methods for combining component classifiers: 1. Generative: initially creates the base classifier and then ensemble them. Boosting, bagging and mixture of experts are some generative techniques. 2. Non-Generative : usually combine independent classifiers. Combination of classifiers depends on output supplied by component classifiers. Input x pattern Classifier 3 Classifier 2 Classifier 1 Combination
  • 8. 8 Our GP based technique will be NON-Generative which will combine information at measurement level. 2.6 Categorization of combination of classifiers: Dietterich described a lot of combination methods based on machine learning. Sharky pointed out that the limiting point i.e factor for combining of classifier is due to lack of awareness of full rule of available modular structure, because of little agreement as describing and classifying various class of classifiers. The comprehensive categorization scheme of classifier ensemble is shown. 1. Voting classifier ensembles 2. Classifier ensembles by manipulating training samples 3. Homogeneous classifier ensembles 4. Recursive partition ensembles 5. Heterogeneous classifier ensembles 2.6 Voting classifier ensembles: thing three main categories are as: 1. A simple voting scheme: in this scheme each individual classifier is an equally weighted votes. The input is assigned to a high majority voted classifier. 2. Weighted electing scheme: every single poll receives a heaviness, proportional to approximated generalization, presentation of the corresponding classifier. This scheme has higher presentation than easy electing. 3. The weighted majority algorithm: is similar to weighted voting but the difference is how weights are generated. 2.7 Classifier ensembles by manipulating training samples: In such approaches, the learning algorithm runs many times and in each time with different training sample partition. Boosting and bugging are two most successful. Bagging: is a easy method for ensemble of classifiers. It is abbreviated from bootstrap aggregates. In this method, primarily disparate datasets are crafted and afterward this every single data set is selected randomly. Contraption discovering method is utilized for training a classifier above every single of these training sets. These classifiers are next requested to the examination example of data by employing electing scheme. Contraption discovering produces a disparate classifier for disparate datasets. These classifiers are joined.
  • 9. 9 Boosting: Was counselled by Schapire in this algorithm each frail discovering algorithm could be boosted to forceful one established on a little hypothetical ideal shouted as frail discovering ideal (PAC). PAC endeavours to join trained classifier for a given setback alongside extra general. In this method the whole data is utilized for deriving a classifier. Homogeneous classifier ensemble: in this ensemble technique only a specific classifier is combined in classifier ensemble. Recursive partition classifier ensemble: divide and vanquish strategy is utilized to partition a space into subset or instances of merely one class. This ideal could be utilized to join decision trees, linear Discriminant purpose and instance established ensemble. Heterogeneous classifier Ensemble: Meta learning and stack generalization are two most used methods. In this approach different type of classifiers ensemble for higher accuracy. 2.8 GP in Combining Classifiers: 1. Generating constituent classifier: a lot of investigation has been grasped out to accomplish probable supremacy of GP for benefitting association results. The classifiers are generated by bagging and boosting technique. These methods could be utilized for training several classifiers at disparate examples of training data. Next the trained classifiers are joined to form a solitary classifier that can enhance GP aftermath. 2. Generating decision trees: to craft decision tree construction at every single inner node, decision conditions that were utilized by decision tree builders were substituted by GP trained numeric expression. A easy, moderately inaccurately expression is trained at origin node. The data example is bypassed one of the two youngster divisions, reliant on association from early origin node. At every single node GP evolved expression was trained to once more categorize the data. This method of joining the decision tree construction alongside GP expression has higher accuracy. 3. Combining classifiers: LangDan et al, has a lot of contribution in the development of composite classifier using GP. Their main motive was to give a trade-off between FPR and TPR for producing highly optimized ROC curve. Various classifier like neural networks,DT, NB were combined. 4. Architecture of GP based combination technique: a. The first layer is same as that of stack generalization
  • 10. 10 b. The output of all individual classifier has been combined to form a new derived training data. Composite classifiers are developed using genetic programming . c. GP based combined classifiers are using threshold T as variable for computing AUCH of ROC Curve. A SET OF SUITABLE COMPONENTpr CLASSIFIERS Fig 2.3 The architecture of the construction of a composite classifier. 2.9 Computing the Prediction of Component Classifier: Two general approaches are there for prediction computation and are as under: 1. Recursive Partitioning the input data space: input data is recursively portioned into subspaces by constituent classifiers. Those subspaces that are in final partition are allocated as forecasts of all instances in that subspace. These algos are requested across decision trees and uses a tear and vanquish way. 2. Global data scope: globe data scope believed is been utilized. A forecast is made by every single constituent class on input instances and next forecasts are joined by joining schemes. Every single classifier is tested for all instances. Stack generalization method is being retained. INPUT DATA OCC AS A FUNCTION OF C1(TOcc as a function of c1[t],c2[t],..........cn[t] GP SIMULATION CYCLE OCC AS A FUNCTION OF C1(T), C2(T)..... CN(T) t[0..1] and other random ), C2(T)..... CN(T) GP SIMULATION CYCLE t[0..1] and other random VARIABLES Occ as a function of c1[t],c2[t],..........cn[t]
  • 11. 11 2.10 GP Based Learning Algorithm: For developing composite classifiers, two phases are required which are training phase and classification phase. Pseudo code for these phases are given below: Training Pseudo Code: S tst: represents testing and training data C(X): class of instance x OCC: a composite classifier C k: kth component classifier C k(X): prediction of C k Train-Composite Classifier: (S t, OCC) 1. All input data is given to k component classifiers i.e x € S t 2. Collect [C1(x),C2(x),C3(x)………,Ck(X)] x € S t for forming a prediction vector. 3. Combine using GP method and take T as threshold to compute AUCH of ROC curve and prediction is used as unitary function in GP tree. Pseudo Code for Classification: 1. To dada sample x taken from Stst apply composite classifier 2. X=[C1(X),C2(X),C3(X),…….Ck(X)], for forming new derived prediction, stack the predictions. 3. OCC(X) is computed.
  • 12. 12 CHAPTER 3 LITERATURE SURVEY: (Michal Woznaik, Manuel Grana and Emilio Corchado 29 April 2013). The paper presents a survey on upto-date data multiple classifier systems through hybrid intelligent system. The major issues which were discussed in this paper were diversity and the methods for decision fusion. The system topologies which are used to design MCS in this paper are as under: 1. Parallel topologies 2. Serial topologies based on adaboost algorithm. The paper addresses the issues i.e Ensemble design and fusion design. In Ensemble design they include mutually complementary individual classifier on the bases of high diversity and accuracy. Fuser design is depicted in fig; Architecture of the MCS making decision on the basis of class label fusion only fig 3.1 DECISION Fusion Combination Rule CLASSIFIER 1 CLASSIFIER 2 CLASSIFIER N
  • 13. 13 Architecture of the MCS which computes the decision on the basis of support function combination. Fig 3.2 The MCS is not the only option for hybridization. The other possibilities are as under: 1. Merging of raw data from different sources are collected and stored in one repository for classifier training. 2. Prior expert knowledge and merging the raw data from different sources. 3. Merging prior knowledge and models machine learning procedures. To design such systems the main points for consideration are as; data, privacy computation and memory efficiency. (In International journal of pattern Recognition and Artificial Intelligence) The paper forwards a SVM ensemble based on choquet integral. The aim of this ensemble model was to predict financial distress using bagging algorithm. The proposed ensemble can be expressed as “choquet + SVMS + Bagging”. Choquet integral has higher average accuracy and stability then single SVM classifier. (Durga Prasad Muni, Nikhil R.Pal senior member IEEE and Jyotirnoy Das) A new approach for designing classifier using genetic programming was proposed. In the paper an integrated view of all classes is taken when genetic programming starts evaluation. In paper a modified decision Count common supports for each class and make decision according to them Support for each class Support for each class Support for each class Support for each class
  • 14. 14 mutation and crossover operation were proposed to reduce the destructive nature of genetic operations. A new believed of unfitness of a tree to select a specific tree for genetic procedure was used. The intention of this unfitness was to furnish a opportunity for unfit tree to come to be fit. In the terminal nodes a new believed of OR-ing was gave that gives a classifier alongside best performance. For fight resolution alike heuristic laws that characterises unclear situation and weight-based scheme was used. The classifier that is selected is able to say “ I don’t understand “ after it faces out situation that are out of its vision domain. The effectiveness of the way is requested on countless real data sets. A solitary run of GP is utilized to design a classifier for outline classification. The normal genetic software design implementation involves the pursuing steps; 1. It begins with random generation of population of solution having size N. 2. Then to each solution of population a fitness value is assigned. 3. Probabistic selection of genetic operator. The various dataset used in this paper are as; A. IRIS dataset B. Wisconson Breast Cancer C. BUPA liver Disorder D. Vehicle Data E. RS-Data The GP approach proposed in this system for designing a classifier requires only a single GP run for optimal classifier evaluation. The various contribution of the paper are as under; 1. An approach for designing classifier for multi category problem using a concept of multitree in Genetic Programming. 2. To use operations like crossover or mutation, tree selection depends on unfitness. 3. Modified crossover operation. 4. Non-destructive direct mutation point operator known as modified or new mutation operator. 5. To optimize classifier an OR-ing operation. 6. For conflict resolution a weight base scheme is used. 7. Modified Kishore et al, heuristic rule.
  • 15. 15 The classifier based on GP was tested with various dataset. The result obtained was satisfactory. Limitations: 1. Size of tree 2. Simultaneous feature analysis with classifier design. (Niusvel Acosta-Mendoza, Alicia Moraj Es-Reyes, Hugo Jar Escalante and Andres Gago- Alonso 2014). The paper utilized a novel way employing genetic software design for constructing heterogeneous ensembles. Ensemble discovering is a novel way aiming at joining assorted individual classifier’s output for presentation improvement. The output of the classifiers is computed by bulk electing or weighted sum. In the paper for discovering of fussion purpose, that are accountable for ensemble classifiers outputs, uses GP established approach. The main focus of this paper is on ensemble of heterogeneous classifiers. The individual classifiers are established on disparate principles. The consequence displays that the method counselled in this paper is exceedingly prosperous at constructing extremely competent models. The counselled methods can additionally be utilized for joining homogeneous methods. (K.M Faraoun, A Bookelif). The paper presents a new way employing genetic software design for classification. The method genetically co-evolves a populace of non-linear makeovers on the data . That is to be categorized and last on, it charts the data to a new subspace that is dimensionally decreased to become a higher inter-class discrimination. It is facile to categorize the new example from the data that is transformed. The method uses vibrant repartition of transformed data employing distinct intervals, the efficiency is grasped by fitness criteria alongside a higher class discrimination. It uses Fisher IRIS dataset. It is benchmarked alongside two datasets. The Fisher IRIS and MIT KDD CUP 99. Fisher IRIS is utilized for analogy and to clarify the method capabilities. The MIT KDD CUP 99 is utilized for intrusion detection. The presentation rates are as under; DR= 0.980(98%) FP= 7E-4(0.07%) Classification rate=99.05% The technique is independent of the dataset and structure of GP employed. (Gianluigi Folino, Giandomenico Spezzano and Clara Pizzuti). The paper presents a Genetic software design way for association of data and instigate ensemble of predictor. Individual classifier will be trained for disparate subsets of data from finished data, next a bulk electing
  • 16. 16 algorithm e,g bagging is utilized to join the classifiers. The aftermath of this way on a colossal data set ambitious that the inclusion of disparate classifier that were trained on a example of the data acquires higher accuracy next that of a solitary classifier. The solitary classifier has most computation price than that of ensemble classifier. 1. The main feature of proposed model is that each sub population generates a classifier which works on a sample of training data instead of using all training set. 2. The approach i.e CGCP is able to cope up with large dataset which are incompatible with main memory. 3. The various experiments on a large real data showed that high accuracy can be obtained by using a reasonable size of sample data at lower computational cost. (Giandomenico Spezzano Lianluigi Folilo and Clere Pizzuti.) An intrusion detection arrangement established on GP is proposed. The GP algorithm is projected for distributed webs for monitoring protection connected hobbies that transpire inside a network. Every single web encompasses a cellular plan established on genetic software design, the main aim of the plan is to produce decision tree predictor. The plan is trained on the innate data stored in the node .the cellular genetic plans work obligingly but independently. The plan seizes the supremacy of ideal for exchanging the outmost individuals. This helps in computation of classifier. After classifiers are computed, they are joined to form the ensemble established on GP. The dataset utilized by this way is KDD CUP 99. Confusion matrix of the arrangement counselled is as : Table 3.1 for confusion matrix Normal Probe Dos U2R R2L Normal 60250.8 200.2 110.8 15.4 15.8 Probe 832.8 2998.4 263.6 26.4 44.8 Dos 7464.2 465.0 221874.8 19.2 29.8 U2R 139.6 45.2 17.2 11.8 14.2 R2L 15151.4 48.6 232.4 173.8 582.8 The paper clarified that we can think an IDS as several useful entities, these entities are encapsulated to form an self-governing agent. This paper additionally clarified that genetic software design can be utilized as a best discovering paradigm for training the agents that can notice intrusion actions potentially. (Urresh Bhowan , Mark Johston, Member IEEE, Mengjie senior member IEEE, Xin Yao Fellow IEEE) The paper addressed the presentation bias endured by contraption discovering algorithm0s because of unbalanced data set. Data set unbalanced way one class is embodied
  • 17. 17 by a colossal number of training example shouted bulk class ,and supplementary one recognized as minority class that encompasses tiny number of training examples. In unbalanced dataset scenario, bulk classifier has good accuracy as contrasted to minority ones. This setback is addressed in the paper, it proposes a multi goal genetic software design way, that evolves precise and varied Ensembles of genetic plan classifier. These classifiers present well on both classes. The paper counselled a arrangement that evaluates the effectiveness of two accepted Pareto-based fitness strategies (SPEA2, NIGA11). The methods are investigated, reassuring diversity amid resolutions in evolved Ensemble. The paper proves that the Ensemble outperforms next their individual associate. (Detecting New Forms of Intrusion Employing Genetic Programming.)In the paper A law progress way that is established on Genetic Software design for noticing novel aggressions on webs and four genetic operators are presented. The four operators utilized to evolue new laws are as, reproduction, mutation, crossover and dropping condition. These new laws are utilized to notice novel or recognized attacks. DARP training and assessing dataset is utilized to evolue these new rules. The law generated by genetic software design has a low fake affirmative law ,a low fake negative rate and a elevated rate of noticing unfamiliar attacks. The new laws have elevated detection rate alongside low fake alarm rate. (A Genetic Algorithm established Web Intrusion Detection System). A contraption discovering algorithm was utilized to recognize the kind of connection i.e whichever attack or normal. The GA seizes into thought assorted features of the connection that are as; kind of protocol, web abiility, rank of connection on the destination and the rank of connection for producing a association law set. The KDDCUP 1999 was utilized to produce such law set that can be requested to recognize disparate class of web aggressions connections. In this paper a law set was industrialized that consists six disparate kind of attacks. The aggressions plummet in two classes namely DOS and Probing. The law generated has 100% of accuracy for noticing the DOS aggressions and has the appreciable accuracy for noticing probe kind of attacks. The aftermath of this examination have given enthusing aftermath. (Intrusion detection employing error correcting output program bassed Ensemble). In this paper the setback that is tackled is concerning class imbalance, rise detection rates for every single class and minimize the fake alarm in intrusion detection . in this paper a examination gave on seven classifier employing bagging and adaboost ing ensemble methods. A new hybrid ensemble established on error Error Correcting Output Program way was designed. This way is established on multicast binary association methods. The seven classifier utilized
  • 18. 18 for examination investigation are as below Navïe Bays, Multi Layer Perceptron, Prop Vector Machine, Radial Basis Purpose Neural Network, J48, Random Tree and Random Forest. The new way gave by this paper enhances the accuracy (99.7%). It additionally increases detection rates and reduces fake alarm even for the minorities classes. (Rohan D.Kulkarni) The KDD CUP 1999 is utilized for intrusion detection and outline matching. A lot of examinations had been grasped out on this dataset. Countless researches has analysed this dataset. The paper encompasses aftermath that were obtained afterward categorizing 10% of kdd cup dataset employing ensemble methods like bagging, boosting, adaboost ing and assesses their presentations alongside average j-48 algorithm. (Roshni dubey and Pradeep nandan pathak). In this journal paper a hybrid design for intrusion detection that merges anamoly detection alongside misuse detection. In this paper a method was counselled that includes an ensemble feature selecting classifier and a data excavating classifier. The preceding consists of four classifier employing disparate set of features and every single employees a contraption discovering algorithm shouted furry belief K-NN association algorithm. The afterward uses data excavating methods to remove computer users normal actions from web traffic data. The paper next afterward on ensembles the output of feature selecting classifier and data excavating classifier are next fudsed combinely to become the final decision. The output of the examination gave in this paper established on hybrid way efficiently generates a extra precise intrusion detection ideal for noticing both the normal and malicious hobbies. (M.R Moosavi, M. Zolghadri in 2012). In this paper a novel cost-sensitive discovering algorithm is counseled to enhance the presentation of nearest acquaintance for intrusion detection. The paper focuses to minimize the finished price in a leave-one-out association of the given training set. As intrusion detection is a setback in that price of disparate misclassification are not same. The distance purpose is described in a parametric form to optimize the nearest acquaintance for intrusion detection. The counselled feature-weighting and instance-weighting algorithm is utilized to adjust the free parameters of the distance function. The feature weighting algorithm can be viewed as finished intention wrapper method for feature weighting. To remove loud and redundant training training instances from training set the instance-weighting algorithm is used. This enhances the speed and presentation in generalization phase. The paper uses the KDD CUP 1999 dataset. Employing this dataset the scheme is prosperous in cutting the average price of association on
  • 19. 19 beforehand unseen data. The scheme additionally removes the redundant data features and instances by setting their heaviness to zero. (Mark Croshie, Eugene H.Spafford). The paper presents a resolution to setbacks that arise in intrusion detection considering computer security. The ideal merges the manmade existence and computer security. The paper uses self-governing agents for implementation of intrusion detection system. The arrangement uses automatically described purpose for evolving genetic software design encompassing several data kind and assures the type-safety. (Amit Kumar, Harish Chandra Maurya, Rahul Misra). In this paper an IDS consists of four constituents according to CISF framework and are: event dynamos, analysers, event databases and reply units. In the paper dataset is utilized to furnish attack and normal data to analyzer. The best contraption discovering algorithm is utilized that enhances the detection rate alerts. The data center will both train and assess the presentation of the analyzer and to evolue its forecast. (Niusvel Acosta-Mendoza and Hugo Jair Escalante 2012). The paper gives a novel way for constructing ensembles that will be established on genetic programming. In the paper a GP established way is utilized to discover mixture purposes that join outputs of every single classifier. A methodical empirical assessment is grasped out to to validate effectiveness of the counseled approach. (Vipin Das,Vijaya Pathak,Sattvik Sharma 2014). In the paper the Rough Set Theory and prop vector contraption is utilized to notice intrusion.RST is utilized to pre-process the seized data and reduces the dimensions. The pre-processed data is dispatched to SVM ideal for discovering and assessing respectively. This method cuts the space density of the data. (Carlotta Domeniconi and Bojun Ya). In the paper the unpredictability of KNN classifier is exploited alongside respect to disparate choice of features to produce varied NN classifier alongside uncorrelated errors. The paper utilises the elevated dimensionality of the data. The consequence displays that the method proposals the presentation improvements. (Yan-Nei Law and Carlo Zaniolo). An incremental association algorithm is counseled in this paper. The algorithm uses the believed of multi-resolution data representation and finds a adaptive nearest acquaintance of a examination point. The incremental algorithm achieves the best presentation by employing tiny ensemble classifier. The classifiers guarantees error bounds for every single ensemble size. The classifier is exceedingly suitable for data stream applications. The examinations gave on the synthetic and real liofe data indicates the
  • 20. 20 counseled algorithm out performs the continuing one in words of accuracy and computational price. (Prof.Dighe Mohit S., Kharde Gayatri B., Mahadik Vrushali G., Gade Archana L., Bondre Namrata R 2015). In this paper the main target was to notice the kind of class and categorize them. The examination displays that the requested consequence detects the the attack and categorize them in 10 clusters alongside concerning 94% accuracy alongside two hidden layer of neurons in neural network. Multi layer perceptron and priori algorithm were utilized in this research. Back propogation method was utilized to enhance and notice the attack and categorize all kinds of aggressions. (Devaraju and S. Ramakrishnan). The multivariate statistical methods were utilized for anomaly detection. Markov flawless is utilized for implementation and is utilized to notice the arrangement on call instituted anomaly detection. Batch sequencing and Adaptive sequencing check point detection is utilized for attack detection in web traffic. The adaboost algorithm that includes decision regulation provides both categorical and steady features. The algorithm focuses on four modules: feature extraction, data labelling, design of the fragile classifiers, and encounter of the forceful classifier. The arrangement works on KDD CUP 1999 intrusion detection dataset. The two subjects of accuracy and efficiency addresse the Conditional and layered methods. (Upendra Assistant Professor, CSE Department, NIT Raipur, C.G., India). The paper analyses two learning algorithms NB and C4.5 for detecting intrusions and then compares them. The paper showed that C4.5 has increased their performance than NB. The C4.5 has highest classification accuracy performance with lowest error rate. (Maninder Singh, Sanjeev Rao 2015). In this scrutiny paper the analogy of all the classifiers is grasped out. The consequence displays that all data excavating methods are not satisfactory enough. As from this scrutiny we can say that random forest is providing extra precise aftermath as contrasted to supplementary classifiers. (Ajith Abraham,Crina Grosan and Carlos Martin-Vide). In this paper an Intrusion detection plan was counseled for noticing attack patterns. This plan works as a defensive mechanism in safeguarding the system. In this plan three variants of genetic software design are utilized that are as under: linear genetic programming,multi-expression software design and gene expression programming. For comparisons countless indices are utilized and next a methodical scutiny of MEP method is provided. From empirical aftermath it is discovered that genetic software design can frolic a main act in growing the intrusion detection program.
  • 21. 21 These intrusion detection plan are light weighted and most precise after difference to standard intrusion detection arrangements that uses contraption discovering as a paradigm for learning. The dataset utilized in this paper was coordinated by 1998 DARPA at MIT linchon labs.
  • 22. 22 CHAPTER 4 4.1 METHODOLOGIES: In this paper we will retain two ways for enhancing the presentation of nearest acquaintance classifier. The early way will be utilized for growing a genetic software design established numeric expression classifier i.e ModNN. The ModNN classifier will be industrialized by plainly modifying the electing and selection methods of KNN classifier. Next way will be utilized for joining the classifier across GP established combination techniques. 4.2 Nearest Neighbour: The nearest acquaintance contraption discovering algorithms reliant of the locale of instances present in the input data. The presently encountered examples are categorized established on the data by now stored in the database. The new examples are categorized established on the closet example ambitious by the Euclidien Distance. The decision is ambitious by the closest k example. To allocate correct class to the data example an optimal mapping purpose f(x) is used. For the association setbacks that have merely two classes the data is categorized in two classes i.e whichever C1 or C2. The output of the Nearest Neighbour classifier is ambitious by the ROC curve. Firstly their output is ambitious and next output is scaled in the scope utilized like (0,1). 4.3 Proposed work 4.3.1 Developing Numeric Expression Classifier: For growing ModNN, optimization methods that are established on genetic software design will be used. The optimized classifiers have larger presentation than easy classifiers. Instituted on the allocation of training examples the mapping from feature space to class space is gave by these numeric expression classifiers. The assorted modules that will be utilized for growing numeric expression classifier are as below:
  • 23. 23 nonn Training data Testing data Output performance Block diagram for developing an optimized numeric classifier. Fig 4.1 4.3.2 GP Module: This module is utilized to attain an optimal solution. In this module GP operators are requested for crafting a new creation from the selected individuals. The procedure terminates after the module gets a wanted consequence that is optimal one. A tree is industrialized to embody a resolution candidate. The terminal nodes of the tree encompass the steady or variable benefits whereas the Non-terminal nodes are embodied by function. The purposes are utilized to procedure the input benefits. 4.3.3 Fitness Computing Module: From GP populace an individual is picked and tested as each the given Threshold scope for performance. For all examination examples the forecast of GP individual is made. For disparate threshold the TPR and FPR are computed to plot the ROC curve. Next the AUCH of the individual is found. The output of this module is is given as input to GP module. The subject alongside the GP procedure is selecting suitable attribute benefits for a GP tree to compute its fitness. But in our case we ponder that the training example that lays distant away from the examination example will give less to decision. So we ponder the median space for Saving optimal nec classifier NEC testing module no Init population generation Fitness evaluation Terminat ion criteria= ? Testing Phase during Evolution Computing AUCH module Construction of a classification model
  • 24. 24 selection of neighbours. In median distance we are possessing four quartile. For our examination we will use the subsequent quartile. In terms of quartile the projection of a test sample x ∈ Stst in 2-D is shown in the figure: Counting normal and attack connection probability using Euclidean space. Fig 4.2 Let Q1n and Q2n be the counting of normal connection sin quartile1 and quartile2 and Q1a and Q2a be the counting of attack connections in corresponding quartile. The counting of attack and counting of normal in every single strip could have disparate weight. Extra weights are automatically given to tinier strips possessing across GP process. Electing is established on the counting of every single class in the strip. For a examination example a forecast of individual will be grasped out by bestowing the benefits of Q1n and Q2n as inputs for computing the probable normal connection count(PN). Comparably we will furnish the data for probable attack connection count (PA). Attack connection probability will be computed by dividing the difference of PA and PN alongside its sum. Attack connection probability will be contrasted alongside threshold. if the probability is larger than the threshold or equal, the examination example is forecasted an attack or else it is forecasted as normal Q2 Q1
  • 25. 25 Yes No No Evaluation module for fitness in GP process. Fig 4.3 4.4 Ensemble: 4.4.1 Combining KNN Classifiers: For combining KNN based classifier we will use two layered architecture. In the first layer of our architectur m component classifiers are constructed. For homogeneous or heterogeneous composite classifier separate GP simulations are used. Combined output prediction OCC is a function of [C1(T), C2(T), C3(T)........... Cn(T)] Probable normal connection count (NC) ()count()PM()pppp()count Probable Attack Connection Varying threshold form 0 to 1 Take a New test symbol All test samples predicted? Compute TPR and FPR Predict Attack Predict Normal Attack prob >= threshold Attack prob=(AC-NC)/(AC+NC) Is Threshold = 1 Compute AUCH
  • 26. 26 C1(x) C2(x) C3(x) Cm(x) A block diagram to develop optimal composite classifier is shown in the figure:4.4 The four main parts in the figure are:  Input dataset  Construction and selection of NN component classifiers  Computation of GP fitness function  GP process to develop OCC 4.4.2 KDD CUP 1999 DATASET: The KDD CUP dataset is divided into three equal but non-overlapping sets such as training data1, testing data1, testing data2 using holdout method. 4.4.3 Construction and selection of KNN component classifiers: First pace in selecting KNN classifier is to craft a set of countless higher giving complementary nearest acquaintance classifier. And the subsequent is that the inclusion of several duplicates of the alike KNN constituent classifier ought to not rise its presentation as elevated as several duplicates of disparate classifiers. Several KNN constituent classifiers are crafted for assorted choices of k by employing Random Selection and Best random selection. -+- -+ - + -+- ++ GP based combining classifiers technique T and other random variable -+- ++ -+- ++
  • 27. 27 4.4.4 GP Fitness module: The data example seized from the assessing data1 will ascertain the fitness of every single individual. Decision benefits of people are obtained. Later fluctuating the threshold T in the range[0,1] the TPR and FPR benefits are computed. The ROC is obtained by plotting these benefits and the AUCHs of ROC arc is obtained. the individual that has higher AUCHs has higher performance. After the fitness score exceeds 0.999 or the number of generations reaches the predetermined maximum creation, the GP simulation will be halted. What is Machine Learning? Learning, like intellect covers such a large scope of procedures that it is tough to delineate precisely. A lexicon meaning includes phrases such as “to gain vision, or understanding of or skill in. by notice, education, or experience.” and ‘‘modification of a behavioral tendency by experience.” Zoologists and psychologists notice discovering in animals and humans. In this book we focus on discovering in machines. There are countless parallels amid animal and contraption learning. Certainly, countless methods in contraption discovering derive from the efforts of psychologists to make supplementary precise their theories of animal and human discovering across computational models. It seems probable additionally that the thoughts and methods being discovered by researchers in contraption discovering might illuminate precise aspects of biological learning. As regards mechanisms, we might say. Extremely generally, that a contraption learns whenever it adjustments its assembly, design, or data (based on its inputs or in answer to external information) in such a manner that it’s anticipated upcoming presentation improves. A slight of these adjustments, such as the supplement of a record to a data center, plummet cozily inside the span of supplementary disciplines and arc not vitally larger understood for being yelled learning. But for example, afterward the presentation of a speech-recognition contraption enhances afterward hearing countless examples of a person’s speech, we sense quite validated in that case to say that the contraption has learned. Machine discovering normally remarks to the adjustments in arrangements that present tasks associated alongside manmade intellect (AI). Such tasks involve recognition. Diagnosis, arranging, robot domination, forecast, etc. The ‘'changes’' might lie cither enhancements to by nowadays providing arrangements or ah initio synthesis of new systems. To locale somewhat supplementary specific, we display the design of a normal AI “agent’' in This agent perceives and models its nature and computes appropriate deeds, perhaps by anticipating their effects. Adjustments made to every single of the constituents shown in the
  • 28. 28 figure might count as learning. Disparate discovering mechanisms might be retained reliant on that subsystem is being changed. One might ask “Why must to mechanisms have to learn? Why not design mechanisms to present as wanted in the main place?’' There are countless reasons why contraption discovering is important. Of sequence, we have by nowadays remarked that the attainment of discovering in mechanisms might assistance us comprehend how animals and humans learn. But there are vital engineering reasons as well. A slight of these are:  Some tasks cannot be delineated well except by example i.e. we might be able to enumerate input/output pairs but not a concise connection amid inputs and wanted outputs. We ought to like mechanisms to be able to adjust their inner assembly to produce correct outputs for a large number of example inputs and consequently suitably constrain their input/output intention to approximate the connection inherent in the examples.  It is probable that hidden amid colossal stacks of data arc vital connections and correlations. Contraption discovering methods can frequently be utilized to remove these connections (data mining).  Human designers oftentimes produce mechanisms that do not work as well as wanted in the settings in that they are used. In fact, precise characteristics of the working nature might not be completely understood at design time. Contraption discovering methods can be utilized for on-the-job enhancement of tolerating contraption sketches.  The number of vision obtainable considering precise tasks might be too large for explicit encoding by humans. Mechanisms that notice this vision softly might be able to arrest supplementary of it than humans ought to desire to contain down.  Environments change above time. Mechanisms that can change to a changing nature should cut the demand for steady redesign. 4.4.5 Genetic Programming Module: In order to produce a subsequent populace three GP operators namely replication, mutation and crossover will be utilized for GP process. These operators aid in meeting to optimal solution. The optimal composite classifier is anticipated at the conclude of GP process. GP’s are heuristic find software design projected to simulate procedures in usual system. GP fit in to the larger class of evolutionary software design that produce resolutions to optimize setbacks employing disparate methods inspired by usual progress such as inheritance,
  • 29. 29 mutation, selection and crossover. These are adaptive heuristic find software design postulated on the evolutionary thoughts of usual selection and genetic. The frank believed of these evolutionary software design is to rouse procedure in usual arrangement vital for evolution. GP’s are utilized for numerical and computational optimization and established on discover the evolutionary aspects of models of communal systems. GP way is utilized to optimize the set of indices derived from convoluted web theory. Genetic software designs are find software design established on the technicians of usual selection and usual genetics. They join survival of the fittest amongst thread constructions alongside structured yet randomized data transactions to form a find software design alongside a little sort of innovative flair of human search. The GP performs a balanced find on assorted nodes and there is a demand to retain populace diversity discovery so that each vital data cannot be capitulated because there is a outstanding demand to focus on fit servings of the population. Reproduction in GP is described as the procedure of producing offspring. The use of GP’s has been utilized to supplement web established approaches. GP to be utilized to optimize a set of indices derived from convoluted web theory. The early necessity of a GP is a set of resolutions embodied by chromosomes shouted population. The resolutions removed from one populace can be utilized to form a new population. This can be more increased that the new populace will be larger than the aged one. The best resolutions are selected to form new offspring. These resolutions are selected on the basis of their fitness i.e. the most suitable offspring will become chances to reproduce.GP’s are utilized for Search, Optimization, and Contraption Learning. GP’s are extremely public method for optimization and are oftentimes prosperous in real requests and to those interested in meta-heuristics. Evolutionary software design are utilized to resolve setbacks that do not by now have a well-defined effectual solution. Genetic software design have been utilized to resolve optimization problem. 4.4.5.1 Basic genetic Operators  Selection  Crossover  Mutation The populace diversity plays a main act in the presentation of GP. It is extensively concurred amid GP developers that the higher the diversity in the populace the less premature convergence chance to transpire and therefore the higher chance to getaway from a innate optima. Disparate crossover strategies were counseled in the works to craft diversity in the
  • 30. 30 population. Goldberg counseled the Partially-Mapped Crossover Operator (PMX) whereas a segment of one parent’s chromosome is mapped into a segment of one more parent’s chromosome and the staying genes are exchanged. One more crossover operator was counseled i.e. Series Crossover operator (CX). This operator creates the offspring from the parents by duplicating the worth of the gene alongside alongside its locale from the parents into the offspring seizing into thought the feasibility of the chromosome. Frequency Crossover (FC) alongside alongside nine disparate kinds of mutations was counseled to resolve the TSP. The FC will be utilized to stabilize the populace as the nine disparate kinds of mutations will be utilized to rise the diversity of the populace to stop premature convergence problem. GP’s tolerate from the difficulty of innate optimum convergence. It is the case after an astonishing individual seize above momentous proportion of the finite populace and leads towards the unwanted convergence. There are assorted methods to circumvent the premature convergence, such as Restricted Mating, use of Incest Prevention, Crowding. Familiarizing a Random Offspring in every single creation adaptive mutation rate , immoderate crossover greediness and low impact of random factors , Communal Catastrophe Technique, Nitching, Vibrant genetic clustering software design (DGCA). 4.4.6 Classic Genetic Programming Step 1. Creating the initial population Initially, countless individual resolutions are at random generated to make an early population. Awfully normally, the populace is generated arbitrarily and covers the finished scope of attainable solutions. Alternatively, the resolutions might additionally be "seeded" in spans wherever optimum resolutions are doubtless to be discovered, for instance, a “seed” is associate degree continuing resolution to be enhanced to an engineering style drawback. Step 2. Analysis and ranking In this pace, the target present fitness comparable to every single individual resolution is computed and, upheld the individual’s fitness benefits, every single individual is allocated a locale collection and additionally the populace is sorted established on these rankings. Step 3. Selection operation If the probability filter is gratified, a confidential is selected and totally bypassed to upcoming generation. Choice strategies are countless though, amid the managing public, the individual resolutions are selected across a fitness-based method whereby the fitter resolutions are normally supplementary doubtless to be elect. Step 4. Crossover operation
  • 31. 31 The likelihood of crossover is set as a parameter at the onset of the program. If the likelihood filter is gratified, 2 people are arbitrarily recognized as parents. One or 2 offspring (variation of programming) are next made from this join of parents. In substitution the parents, the offspring must to be a minimum of possible. This might involve trailing completely disparate crossover parameters to comprehend practicableness. Step 5. Mutation operation The likelihood of mutation has been set as a parameter. If the likelihood filter is happy, a confidential is next elect and subjected to mutation. The target of mutation in GPs is to permit the formula to circumvent innate minima by stopping the populace of the candidate resolution from being manipulated by a number of best candidates, so decelerating, or maybe halting, progress. Step 6. Termination test If a termination condition is grasped, the generational method are going to be terminated or else steps two to five are going to be continual. Public terminating conditions include: a fulfilling resolution being discovered, a fixed collection of generations being grasped, or the formula possessing met to an optimum resolution enumerated sequential iterations no longer produce higher aftermath.
  • 32. 32 No Yes Fig 4.5 Flowchart of genetic programming 4.4.7 GP: advantages and disadvantages Advantages:  The GP is, in nature, a parallel search as variety of candidate solutions are simultaneously thought of in order that a worldwide optima is additional doubtless to be found.  Compared to gradient-based strategies, GPs have less mathematical needs (such as differentiability of the target and constraint purposes, continuity of the variables, etc.) for the enhancement subjects so that they will grasp each kinds of goal purposes and constraints delineate in different, constant or varied find spans and demand merely frank computations in every single iteration. Evaluation and Ranking Mutation Terminate Selection Crossover Output result Model Parameter setting Population initialization Population Creation of new population
  • 33. 33 Disadvantages:  GPs are less economical than the gradient-based programming once resolution optimum issues with pure continuous variables, as indicated by the actual fact that lots additional iterations are needed for convergence.  Compared to gradient-based and directional strategies, several perform evaluations are required in every iteration of the GP and so it costs far more in computer time for every iteration. One focus of the scrutiny is to develop a generic GP setback solver for resolution engineering enhancement subjects, that normally have constant, number and different variables or a blend of those. This sort of subjects is usually uttered as varied different issues. To be as finished as attainable, GP solvers believed of across this scrutiny are for varied different improvement. The public disadvantage of continuing GP solvers is that they lack the flexibleness in grasping varied different issues. One design to address this subject was crafted by Deb who projected a GP to grasp varied variables by retaining a varied different coding theme alongside a varied different crossover and a varied different mutation operator. One setback connected to Deb’s methodology is that these operators demand to be reprogrammed to suit completely disparate style subjects, that is, the hidden including of Deb’s way is problem- dependent. Such re-programming is luxurious and long thereby manipulating its colossal request. 4.4.8 K-Nearest Neighbours classifier The fundamental issue in data mining to address the classification problem is learning of classifiers .In order to overcome this fundamental issue a dataset is designed ,which contains a set of training instances and corresponding labels, based on this data set the classifier is trained and is used to predict the the class of an unseen instance encounterd . the instance of a class is defined by an vector. A instance x can be defined by by a vector < A1(x), A2(x)… An(x) >, where Ai(x) denotes ith attribute value , to define the class variables and its values we use the symbol C and c . The class of the instance x is denoted by c(x). K-Nearest-Neighbours classifiers has been widely used in classification problems. The classifier is holly and solly dependent on distance .The distance function is used to determine the difference or similarity between two instances.The distance function generally uses the standard Euclidean distance.The distance between two instances is defined as;
  • 34. 34 d(x, y) = √∑ (ai(x)–ai(y))2n i=1 For any instance x, The classifier measutres the distance and assigns the instance to class of x’s k nearest neighbors to x, as shown in Equation. The KNN classifier is a best example of lazy learning algorithm, which stores the training data at the time of training and proves its capabilities on the time of classification. In comparison to lazy learning , eager learning on the time of training generates an explicit model. c(x) = argmax c ∈ C ∑ δ(c,c(yii))k i=1 In the above equation y1, y2,…., yk are most k nearest neighbors of the instance x, k is the number of the neighbors, andδ(c, c(yi)) = 1 if c = c(yi) and δ(c, c(yi)) = 0 otherwise. No doubt KNN classifier have been widely used for decades of years because of its simplicity, effectiveness and robustness. There are many shortcomings of this classifier and three main shortcomings confronting are as : 1) Euclidean distance is used as a distance function for measuring the difference or similarity. 2) The input parameter for neighborhood size is artificially assigned ; 3) The simple voting algorithm is used as class probability estimation . To overcome these shortcomings three main approaches are used which are as under: 1) We use more accurate distance functions as a replacement to the standard Euclidean distance; 2) Artificial input parameter k is replaced by searching the best neighborhood ; 3) Find some more accurate class probability estimation methods to replace the simple voting method. 4.4.9 Principal component analysis Principal constituent scrutiny (PCA) is one of the most priceless aftermath which was demanded from linear algebra. Principal constituent scrutiny is utilized plentifully in each and every forms of scrutiny – starting from neuroscience up to computer graphics - its characteristics like facile, non-parametric method of eliminating redundant data from mystifying data sets makes it so verstile. PCA can provides a roadmap for how to cut a convoluted data set to a lower dimension to expose the from period to period hidden, clear dynamics that oftentimes underlie it with negligible supplementary manipulation. Main target of this serving will be to explore both an intuitive sense for PCA, and a methodical explaination of this topic. We will onset alongside a facile example and furnish an
  • 35. 35 intuitive explanation of the target of PCA. The simple way is to tolerate is to locale it inside the framework of linear algebra by adding mathematical rigor and problem will explicitly ascertained. To discuss how and why PCA is intimately related to the mathematical method of singular worth decomposition (SVD). This result will lead us to a prescription for how to make a best use of PCA in the real world. We will debate both the assumptions behind this method as well as probable expansions to eliminate these limitations. 4.4.9.1 An Example of PCA Assumption: Consider that we are an observer, and are trying to comprehend a slight phenomenon by computing varied numbers (e.g. spectra, voltages, velocities, etc.) in our system. By adversity, we may not figure out what is happening , because of the data set. The dataset may appears unclear, clouded and even redundant. It may not be a trivial setback, but rather a fake obstacle to experimental procedures . Examples abound from convoluted arrangements such as neuroscience, photo science, meteorology and oceanography – there might be a blucky number of variables to compute can , because the underlying dynamics might oftentimes can be too simple. To make it clear we will take an simple example of a facile toy diagrammed in Figure 4.3, setback from physical science. Assume that we are studying the gesture of the physicist’s immaculate spring. This experimernt contains a ball of mass m which is tied to a mass less, frictionless spring. On releasing ball a puny distance away from equilibrium (i.e. the spring is stretched). It is going to oscillates indefinitely considering its equilibrium at a set frequency alongside the x-axis as it is “ideal,”. The average setback in physics in that the gesture alongside the x association is resolved by an external intention of time. In supplementary words, the overall dynamics can be expressed because of a solitary variable x.
  • 36. 36 Figure 4.6 -Error! No text of specified style in document.-1 Diagram of the toy example. As, being less experts in experiments we may not understand each of this. We do not understand that, allow alone how countless, axes and dimensions are not easy to measure. Thus, in order to compute we choose the ball’s locale in a 3-dimensional space. Simply , we locale three movie cameras concerning our arrangement of interest. Every single movie camera records an picture at 200 Hz representing a two dimensional locale of the ball. Adversity , because of our ignorance, we do not even understand what are the real “x”, “y” and “z” axes, so we select three camera axes {~a, ~b, ~c} at a little arbitrary slants alongside respect to the system. The slants amid our measurements could not even be 90o! Now, we record alongside the cameras for 2 minutes. The large question remains: how do we become from this data set to a easy equation of x? The Goal: Main constituent scrutiny computes the most accurate basis to represent a loud, dirty data set. The problem is that it will filter out the sound and will expose hidden dynamics. The aim of PCA is to present : “the dynamics are alongside the x-axis.” In supplementary words, the aim of PCA is to make assumption that ˆx - the constituent basis vector alongside the x-axis - is the vital dimension. Ascertaining this fact permits an experimenter to discern that dynamics are vital and that are redundant. 4.4.10 KDD CUP 99 DATA SET DESCRIPTIONS KDD’99 is the the most widely used data set for intrusion detection procedures . The data set is corrected and is crafted established on the data collected in DARPA’98 IDS evaluation programs. The DARPA’98 contains 4 gigabytes of compressed binary data of 7 weeks of overall web traffic, which is processed into concerning 5 million records of connection, every single record contains 100 bytes of traffic data. 2 million connection records has been examined in two weeks. The KDD training dataset encompasses of concerning 4,900,000
  • 37. 37 solitary connection vectors, every single vector of that contains 41 features and is classified as normal or an attack, alongside precisely one particular type of attack. The attacks fall in of the pursuing four groups: 1) DoS : in this attack the attacker makes a little bit process computing or makes the resource unavailable or too maximum to grasp highest demands, or may completely denies users admission to a contraption. 2) U2R: This type of attack the vulnerability in order to become the administrator of the victim computer . This is achieved by passwords sniffing , a lexicon attack, or communal engineering.. 3) R2L: occurs after an attacker who has the skill to dispatch packets to a contraption above a web but who does not have an report on that contraption exploits a little vulnerability to gain innate admission as a user of that contraption. 4) Probing Attack: in this type of attack we collects information regarding a broad network of machines . And the information is used to compromise its security controls. The KDD’99 CUP dataset features can be categorized into main three categories: 1) Basic features: in this group, the attributes of a TCP/IP connection are extracted . Among all these features many leads to a delay in the detection. 2) Traffic features: it encompasses those features which were computed using window interval This group is divided into two categories: a) same host features: The main aim is to examine the connection established in past 2 second having exact destination host which the current connection is holding. The statistics of protocol service, behavior etc is calculated b) “same service” features: The main aim is to examine the connection established in past 2 second having same service which the current connection is holding.. Above mentioned two kinds of “traffic” are mainly based on time . Though, there are countless sluggish aggressions which scan the ports employing a far high time period than 2 seconds, e,g, one in each single minute. The consequences of these aggressions may not present intrusion outlines alongside a period window of time two seconds. In order to resolve this setback, the “same host” and “same service” features are calculated and established on the connection window containing 100 connections as a replacement of period window of two seconds. These are termed asconnection-based traffic features.
  • 38. 38 3) Content features: R2L as well as U2R attack types don’t bare sequential patterns as in most Pribing and DoS attacks. On the other hand DoS and Probing type attacks mainly contains countless connections to a small host(s) in a very short spain of time; R2L as well as U2R attacks are mainly embedded in the data servings of the TCP/IP packets, these involve highly a single connection. In order to notice these kinds of attacks , a little features able to gaze for dubious deeds in the data serving, e.g., number of floundered login attempts. 4.4.11 Experimental Setup: A. Input Database: KDD CUP 99 B. PCA based feature extraction C. KNN Classifiers. D. Matlab. E. Genetic Programming toolkit
  • 39. 39 4.4.12Terms: ROC: It stands for receiver operating curve and is a graphical plot that determines the performance of a binary classifier as the threshold varies. It is obtained by plotting the true positive rate and false positive rate as the threshold Discriminant varies. True positive rate or sensitivity is given by: TPR= TP P = TP TP+PN Specificity or True Negative Rate TNR= TN N = TN FP+TN Positive Predictive Value or Precision PPV= TP TP+FP Negative Predictive Value NPV= TN TN+FN False Positive Rate FPR= FP N = FP FP+TN = 1- Spc False Negative Rate FNR = FN P = FN FN+TP Accuracy = TP+TN P+N .
  • 40. 40 CHAPTER 5 Results and Discussion 5.1 GP based Optimization 1. The algorithm begins by creating a random initial population. 2. The algorithm then creates a sequence of new populations. At each step, the algorithm uses the individuals in the current generation to create the next population. To create the new population, the algorithm performs the following steps: a. Scores each member of the current population by computing its fitness value. b. Scales the raw fitness scores to convert them into a more usable range of values. c. Selects members, called parents, based on their fitness. d. Some of the individuals in the current population that have lower fitness are chosen as elite. These elite individuals are passed to the next population. e. Produces children from the parents. Children are produced either by making random changes to a single parent—mutation—or by combining the vector entries of a pair of parents—crossover. f. Replaces the current population with the children to form the next generation.
  • 41. 41 Fig 5.1 : Current Best individual of classifier optimization using Genetic Programing Fig 5.2 : Number of Childer for Selectin per individual  Generations — The algorithm stops when the number of generations reaches the value of Generations.  Time limit — The algorithm stops after running for an amount of time in seconds equal to Time limit.  Fitness limit — The algorithm stops when the value of the fitness function for the best point in the current population is less than or equal toFitness limit.  Stall generations — The algorithm stops when the average relative change in the fitness function value over Stall generations is less thanFunction tolerance.  Stall time limit — The algorithm stops if there is no improvement in the objective function during an interval of time in seconds equal to Stall time limit.  Stall test — The stall condition is whichever average change or geometric weighted. For geometric weighted, the weighting purpose is 1/2n, whereas n is the number of generations prior to the current. Both stall conditions apply to the comparative change in the fitness purpose above Stall generations.
  • 42. 42 Fig 5.3 : Stopping criteria Fig 5.4 : As generations grow average distance between individual drops significantly Table 5.1: Optimization Tabel used for improving accuracu of NN Classifier using GP records folds K Model Name Accuracy (%) Time(mins) 1000 10 50 3 91.63194597 20.02204587 2000 9 33 2 95.56515682 24.58807926 3000 9 44 1 92.84629231 20.33630915 4000 6 32 2 97.8597933 29.40310606
  • 43. 43 5000 1 49 3 98.42251766 18.75813973 6000 8 49 1 91.22030836 17.1748795 7000 6 50 2 96.65303874 15.18687511 8000 10 45 1 98.18732645 38.65139885 9000 3 14 3 96.26225582 26.98995545 10000 6 21 3 91.39581563 40.5710283 11000 2 46 1 96.55104887 30.96600191 12000 7 41 1 98.45506164 29.24735065 13000 4 34 3 91.11947658 39.82860323 14000 9 14 1 98.23991181 27.07765661 15000 5 39 1 99.08721672 29.2388353 16000 4 11 3 96.94348921 37.3515081 17000 2 35 1 98.03364981 18.86148666 18000 3 31 1 95.04835363 20.01646052 19000 9 37 1 91.91128127 28.84630711 20000 3 16 1 97.78661687 32.06463708 21000 10 48 2 91.4506003 27.44625745 22000 5 33 2 93.55287676 21.8182112 23000 8 23 2 95.88219766 15.99263929 24000 4 28 2 94.10930793 26.05294057 25000 7 35 1 92.14031026 39.84650131 26000 1 11 3 98.14608682 34.30178114 27000 9 39 3 92.52787595 22.83691867 28000 6 38 3 91.54489232 31.22483872 29000 9 31 3 96.85810518 30.88720437 30000 6 14 1 92.60184764 39.34734564 31000 9 17 3 95.78801676 32.97699579 32000 5 36 3 92.57898704 22.20743218 33000 9 48 1 98.08208141 27.71891095 34000 5 24 2 98.3403275 24.44279958 35000 1 28 3 91.46325454 16.68770649 36000 10 12 2 92.43179309 33.64797178
  • 44. 44 37000 1 31 3 98.88151236 27.27911689 38000 1 50 3 96.13163571 17.72974834 39000 5 22 1 97.02544502 34.0623758 40000 3 12 2 96.15792541 30.01084341 41000 8 33 1 92.6072532 34.91882885 42000 6 30 2 95.37924219 38.42211453 43000 4 15 1 92.30750405 32.68717421 44000 7 28 3 94.49452536 40.36206059 45000 7 27 3 95.45092271 16.34874981 46000 5 27 1 96.64010756 40.30618937 47000 2 32 2 92.37634594 31.34647659 48000 2 36 1 91.23987267 21.16304196 49000 3 43 3 97.40824637 37.21525023 50000 1 12 1 97.17706732 18.53949767 51000 9 42 1 96.01816413 30.1555764 52000 2 38 3 92.48792584 17.66309901 53000 10 44 2 94.46830957 22.97369953 54000 7 30 2 98.10097504 40.13554461 55000 4 31 1 97.95035846 21.20222369 56000 2 40 3 97.13868828 35.6901613 57000 2 48 3 93.55654996 16.55046968 58000 4 20 3 92.38226205 31.27052777 59000 5 37 2 93.48526928 32.75813635 60000 1 33 3 98.84530446 32.4376096 61000 1 17 1 96.28232308 32.10352485 62000 3 34 2 96.97965357 33.72651938 63000 4 48 1 98.89024412 29.92691276 64000 3 25 3 92.82758822 21.51346239 65000 6 15 1 93.60358509 17.88959377 66000 5 11 1 94.80369651 36.65892868 67000 10 17 2 96.40250124 36.71025453 68000 1 27 3 91.35014933 24.14168778
  • 45. 45 69000 6 10 3 93.83818346 16.49742118 70000 8 31 2 95.1970614 20.41371837 71000 3 10 2 96.76017653 37.57928148 72000 6 26 2 96.36861206 17.56327725 73000 4 27 1 98.24578044 38.16389526 74000 7 24 1 93.77139385 18.22701564 75000 1 30 1 91.69639297 23.04492839 76000 4 29 2 96.51773395 24.76764888 77000 9 44 1 98.5190595 39.95662766 78000 7 46 1 94.7629563 17.26677531 79000 1 29 3 94.85044372 38.49637397 80000 2 29 1 98.62917285 40.85751541 81000 4 46 2 93.24627013 35.68743522 82000 5 11 1 99.71532399 28.52575177 83000 8 30 2 95.62676147 23.42587948 84000 3 14 1 92.66271567 24.37432495 85000 6 18 1 95.29526776 22.80211135 86000 8 14 2 92.46144523 20.48319044 87000 2 27 2 94.56656978 35.45920674 88000 2 42 2 97.89826853 39.36945694 89000 1 29 3 92.26531075 33.77601159 90000 4 35 2 94.83214801 38.84197587 91000 8 32 1 91.21519486 22.31646413 92000 3 40 1 91.71358899 39.89790079 93000 5 33 2 98.46424145 34.74886128 94000 7 34 3 94.08572974 23.59100755 95000 5 37 1 98.35090909 35.04958557 96000 1 35 1 97.92837342 26.35092419 97000 4 39 2 94.66189627 26.24445744 98000 8 20 3 98.20442344 31.96238143 99000 8 40 2 93.89936277 15.30406946 100000 4 15 2 97.61285491 31.05023913
  • 46. 46 5.2 Ensemble approach In this way, we early craft hybrid classifiers individually to attain a good generalization presentation (optimizing the ideal for presentation on unseen data rather than the training data). Examination data is bypassed across every single individual ideal and the corresponding outputs are utilized to choose the final output. Empirical aftermath delineate that the counselled ensemble way gives larger presentation for noticing probes and U2R aggressions than all the three individual models. The Ensemble way classifies most of them accurately by picking up all the classes that are accurately categorized by all the three classifiers. As anticipated the ensemble way exploits the contrasts in misclassification and enhances the finished performance. As evident, all the classifiers believed so distant might not present well for noticing all the attacks. To seize supremacy of the presentation of the disparate classifiers a hierarchical hybrid intelligent arrangement is counselled as delineated. Fig 5.5: Scatter Plot of Host rat with Count For Class using KNN
  • 47. 47 Fig 5.6 : Scatter Plot of bytes transferred with bytes received for Class using KNN Fig 5.7 : Classification Results using Ensemble of Classifiers
  • 48. 48 In statistics, a receiver working characteristic (ROC), or ROC arc, is a graphical plot that illustrates the presentation of a binary classifier arrangement as its discrimination threshold is varied. The arc is crafted by plotting the real affirmative rate (TPR) opposing the fake affirmative rate (FPR) at assorted threshold settings. The true-positive rate is additionally recognized as sensitivity or the sensitivity index d', recognized as "d-prime" in gesture detection and biomedical informatics, or recall in contraption learning. Figure below displays the ROC Arc for GP established NN Classifier. Fig 5.8 : ROC curve for GP based Classifier showing 0.99976 area under the curve The algorithm creates KNN a using training data set, in order to maximize the chances of detection from training data set and from each class it will reduce errors . Model was developed on the basis of KDD CUP data set. First decision tree based on the KDD CUP training data subset and the second on the basis of KDD CUP testing data subset was created. The decision tree developed using KDD KUP training data subset was tested over KDD CUP testing data subset and the vice versa. After creation of the K-NN models for
  • 49. 49 U2R and R2L type attacks, Using the Nearest Neighbour rules utility, optimized rules were extracted.. Normal Packets Training Testing Data Subset KDD 10% corrected KDD 10% corrected Normal Packets 99.99% 99.98% Table 5.2: Comparison of GP Ensemble NN Performance for the R2L Attack Category fig 5.9 Confusion matrix for normal records
  • 50. 50 Fig 5.10 Comparison of GP Ensemble NN Performance for Normal Category U2R Record Detection Rate Training Testing Data Subset KDD 10% corrected KDD 10% corrected Not-U2R Record Detection Rate 99.95% 99.91% U2R Record Detection Rate 96.92% 96.14% Table 5.3 : Comparison of GP Ensemble NN Performance for the U2R Attack Category 99.97% 99.98% 99.98% 99.98% 99.98% 99.98% 99.99% 99.99% 99.99% 99.99% KDD 10% corrected KDD 10% corrected Training Testing Normal Packets
  • 51. 51 Fig 5.11 Confusion matrix for U2R type attacks Fig 5.12 Comparison of GP Ensemble NN Performance for the U2R Attack Category
  • 52. 52 R2L Record Detection Rate Training Testing Data Subset KDD 10% corrected KDD 10% corrected Not-R2L Record Detection Rate 98.12% 97.89% R2L Record Detection Rate 98.34% 99.87% Table 5.4: Comparison of GP Ensemble NN Performance for the R2L Attack Category Fig 5.13 Confusion matrix for R2L
  • 53. 53 Fig 5.14 Comparison of GP Ensemble NN Performance for the R2L Attack Category Probing Record Detection Rate Training Testing Data Subset KDD 10% corrected KDD 10% corrected Not-probing Record Detection Rate 98.13% 98.44% probing Record Detection Rate 94.56% 93.59% Table 5.5 : Comparison of GP Ensemble NN Performance for the probe Attack Category 97.60% 97.70% 97.80% 97.90% 98.00% 98.10% 98.20% 98.30% 98.40% Training Testing Not-R2L Record Detection Rate R2L Record Detection Rate
  • 54. 54 Fig 5.15 Confusion matrix for probe type attack . Fig 5.16 Comparison of GP Ensemble NN Performance for the probe Attack Category 0 0.2 0.4 0.6 0.8 1 1.2 Training Testing Data Subset Not-probing Record Detection Rate probing Record Detection Rate
  • 55. 55 CHAPTER 6 CONCLUSION AND REFRENCES 6.1 CONCLUSION: To addresses various issues we proposed a system i.e ensemble using genetic program which will have a better performance as compared to others. In this paper we ensemble only NN classifiers. Few carried out on resembling heterogeneous type to make out good results. This paper has address many issues which are creating trouble for designing effective classifiers. The paper discussed the GP in detailed and how can be a classifier of better performance be developed. In short in ensemble of classifiers using genetic programming a lot of human expert requirement has been decreased and an automatic system has been developed. As depicted from results when the models were trained on four folds ,the information required to achieve desirable detection rate was very high. Only 80% detection rate was achived by the algorithms which were tested in literature for R2L as well as U2R attacks. If we increase the number of records to 99.98% of the training data subset in the testing data subset the detection rate for R2L and U2R type attacks will increase to 99%. Over last decades pattern recognition approach in intrusion detection ,attracted a lot of interests. The demand for reliable and sophisticated intrusion systems has been increased for detection of polymorphous attacks. In this paper , We have presented a novel type intrusion detection approach which uses Genetic programming based Ensemble approach for detecting intrusion detection. The experimental results demonstrate that the GP base Ensemble Classifier is effective for reducing false alarm information such that the widespread IDS systems can be implemented using our approach considering both accuracy and interpretability. In future Feature selection can be used not only to alleviate the curse of dimensionality and minimize classification errors, but also to improve the interpretability of Ensemble-based classifiers. Our future work will focus on reducing features for the classifiers by methods of feature selection. Also, the work will be continued to study the fitness function of the genetic algorithm to manipulate more parameters of the fuzzy inference module, even concentrating on fuzzy rules themselves 6.2 Future Scope: Various papers counseled assorted methodologies for ensemble of classifiers. The ensemble so gave was for a specific request field. No mistrust these arrangements have a lot of gains, but they were possessing limitations too. The assorted limitations like diversity, computation period, detection ratio and fake alarm were the main issues. If one arrangement addresses one
  • 56. 56 subject rest supplementary were flouted that was the bigger obligation of these systems. Generally these arrangement did not work well on unbalanced data. To addresses these all subjects we counseled a arrangement i.e ensemble employing genetic plan that will have a larger presentation as contrasted to others. In this paper we ensemble merely NN classifiers. Upcoming work will be grasped out on resembling heterogeneous kind to make out good results. This paper will address countless subjects that are crafting concern for arranging competent classifiers. The paper debated the GP in methodical and how can be a classifier of larger presentation be developed. In short in ensemble of classifiers employing genetic software design a lot of human expert necessity has been cut and an automatic arrangement has been industrialized. 6.3 REFRENCES: 1. Gianluigi Folino, Giandomenico Spezzano and Clara Pizzuti, Ensemble Techniquesfor parallel Genetic Programming based classifier. 2. Durga Prasad Muni, Nikhil R. Pal, Senior Member, IEEE, and Jyotirmoy Das, APRIL 2004, A Novel Approach to Design ClassifiersUsing Genetic Programming,IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 8, NO. 2, APRIL 2004. 3. Michał Woz´niak, Manuel Grana, Emilio Corchado,2014,A survey of multiple classifier systems as hybrid systems, ELSEVIER. 4. Mark Crosbie, Eugene H. Spafford,1995, Applying Genetic Programming to Intrusion Detection, AAAI Technical Report FS-95-01. Compilation copyright © 1995, AAAI (www.aaai.org). 5. K. M. FARAOUN, 2006, GENETIC PROGRAMMING APPROACH FORMULTI- CATEGORY PATTERN CLASSIFICATION APPLIED TO NETWORK INTRUSIONS DETECTION, International Journal of Computational Intelligence and ApplicationsVol. 6, No. 1 (2006) 77–99imperial College Press. 6. Gianluigi Folino, Clara Pizzuti and Giandomenico Spezzano, GP Ensemble forDistributed Intrusion Detection Systems, ICAR-CNR,Via P.Bucci 41/C,Univ. Della Calabria87036 Rende (CS), Italy.
  • 57. 57 7. NIUSVEL ACOSTA-MENDOZA, ALICIA MORALES-REYES,HUGO JAIR ESCALANTE and ANDRÉS GAGO-ALONSO,2014,LEARNING TO ASSEMBLE CLASSIFIERS VIAGENETIC PROGRAMMING, International Journal of Pattern Recognitionand Arti¯cial IntelligenceVol. 28, No. 7 (2014) 1460005 (19 pages)#.c World Scienti¯c Publishing Company. 8. Xihua Li, Fuqiang Wang and Xiaohong Chen,2015, Support Vector Machine Ensemble Based on ChoquetIntegral for Financial Distress Prediction, International Journal of Pattern Recognitionand Arti¯cial IntelligenceVol. 29, No. 4 (2015) 1550016 (24 pages)#.c World Scienti¯c Publishing Company. 9. Preeti Aggarwala,, Sudhir Kumar Sharma,2015, Analysis of KDD Dataset Attributes - Class wise For IntrusionDetection, 3rd International Conference on Recent Trends in Computing 2015 (ICRTC-2015). 10. Anup Goyal, Chetan Kumar, GA-NIDS: A Genetic Algorithm based Network Intrusion Detection System. 11. Urvesh Bhowan, Mark Johnston, Member, IEEE, Mengjie Zhang, Senior Member, IEEE, and Xin Yao, Fellow, IEEE, JUNE 2013, Evolving Diverse Ensembles Using GeneticProgramming for Classification With Unbalanced Data, IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 17, NO. 3, JUNE 2013. 12. Dr. Saurabh Mukherjeea, Neelam Sharma, ( 2012 ), Intrusion Detection using Naive Bayes Classifier with Feature Reduction. 13. H Nguyen, K Franke, S Petrovic Improving Effectiveness of Intrusion Detection by CorrelationFeature Selection, 2010 International Conference on Availability, Reliability and Security,IEEE. 14. Ms.Nivedita Naidu, Dr.R.V.Dharaskar “An effective approach to network intrusion detectionsystem using genetic algorithm”, International Journal of Computer Applications (0975 – 8887) Volume 1 – No. 2, 2010. 15. N. Chawla and J. Sylvester, “Exploiting diversity in ensembles: Improving the performance on unbalanced datasets,” in Proc. 7th Int. Conf. MCS, 2007, pp. 397–406. 16. H. Abbass, “Pareto-optimal approaches to neuro-ensemble learning,” in Multi- Objective Machine Learning (Studies in Computational Intelligence, vol. 16), Y. Jin, Ed. Berlin/Heidelberg, Germany: Springer, 2006, pp. 407–427.
  • 58. 58 17. U. Bhowan, M. Zhang, and M. Johnston, “Genetic programming for classification with unbalanced data,” in Proc. 13th Eur. Conf. Genet.Programming, LNCS 6021. 2010, pp. 1–13. 18. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The WEKA data mining software: An update,” SIGKDDExplorations, vol. 11, no. 1, pp. 10–18, Nov. 2009. 19. Z. Wang, K. Tang, and X. Yao, “Multiobjective approaches to optimal testing resource allocation in modular software systems,” IEEE Trans.Reliab., vol. 59, no. 3, pp. 563– 575, Sep. 2010. 20. C. Coello Coello, G. Lamont, and D. Veldhuizen, Evolutionary Algorithmsfor Solving Multi-Objective Problems (Genetic and EvolutionaryComputation Series). Berlin, Germany: Springer, 2007. 21. I Ahmad, A B Abdulah, A S Alghamdi, K Alnfajan,M Hussain, Feature Subset Selection forNetwork Intrusion Detection Mechanism Using Genetic Eigen Vectors, Proc .of CSIT vol.5(2011). 22. Saman M. Abdulla, Najla B. Al-Dabagh, Omar Zakaria, Identify Features and Parameters to Devise an Accurate Intrusion Detection System Using Artificial Neural Network, World Academy of Science, Engineering and Technology 2010. 23. NSL-KDD dataset for network –based intrusion detection systems” available on http://iscx.info/NSL-KDD/. 24. http://www.cs.waikato.ac.nz/~ml /weka/. 25. Gianluigi Folino, Clara Pizzuti, Giandomenico Spezzano. 2010. An ensemble- basedevolutionary framework for coping with distributed intrusion detection. GeneticProgramming and Evolvable Machines 11, 131-146. 26. Shelly Xiaonan Wu, Wolfgang Banzhaf. 2010. The use of computational intelligence inintrusion detection systems: A review. Applied Soft Computing 10, 1-35. [CrossRef]. 27. Ahmad Taher Azar, Hanaa Ismail Elshazly, Aboul Ella Hassanien, Abeer Mohamed Elkorany. 2013. A random forest classifier for lymph diseases. Computer Methods andPrograms in Biomedicine. 28. P. Ravisankar, V. Ravi, G. Raghava Rao, I. Bose. 2011. Detection of financial statement fraud And feature selection using data mining techniques. Decision Support Systems 50, 491-500.
  • 59. 59