SlideShare ist ein Scribd-Unternehmen logo
1 von 5
Downloaden Sie, um offline zu lesen
Machine Learning Based Botnet Detection

        Vaibhav Nivargi            Mayukh Bhaowal                                      Teddy Lee
                     {vnivargi, mayukhb, tlee21}@cs.stanford.edu
                            CS 229 Final Project Report

               I. INTRODUCTION                          these concerted efforts, Botnets remain an unsolved
                                                        problem for the online community.
A Botnet [1] is a large collection of compromised
machines, referred to as zombies [2], under a
common Command-and-Control infrastructure
(C&C), typically used for nefarious purposes.
Botnets are used in a variety of online crimes
including, and not limited to, large scale DDoS
attacks, Spam, Click Fraud, Extortion, and Identity
theft. The scale and geographical diversity of the
machines enlisted in a Botnet, coupled with easily
available source code, and support from
communities, as well as mercenary Botmasters
providing Botnet services for rent, have resulted in
Botnets becoming a highly sophisticated and
effective tool for committing online crime in recent
times [3][4]. Botnets with thousands and millions
of nodes have been observed in the wild, with
newer ones being observe every day [10].
          The lifecycle of a Botnet is depicted in
Fig.1. The initial enlisting occurs by exploiting a
known vulnerability in the Operating systems
running on the machines using a wide variety of                        Fig 1. Botnet in action
mechanisms like worms, Trojans, P2P file sharing
networks, and exploits of common Windows                                    II. DATA
vulnerabilities, etc. Once compromised, the bot is
programmed to connect to a central location             We had two separate data sets to collect for the
(typically an IRC [11] server), where the Botmaster     purpose of our experiments. The first set included a
could login and issue commands to the logged in         large number of binaries/executables labeled as
bots. This mechanism essentially means that the         botnet binaries and benign binaries. We acquired
communication is free, as broadcast is taken care of    the botnet binaries from the computer science
by the IRC channel. Most Bots additionally ship         department at John Hopkins University. This is the
with multiple IRC and DNS addresses, meaning            same data they used for botnet related work [18].
taking down one such IRC channel does not in            As far as the benign binaries are concerned we
general impair the activities of the Botnet.            randomly picked them up from Unix and windows
          Originally, most techniques to thwart         machines, namely from /usr/bin and from windows
Botnets have been reactive, reducing their              system32 directory. This data was comprehensive
effectiveness significantly, e.g. using Honeypots to    and well represented.
trap and study malware, etc. Of late, significant                We also needed labeled IRC logs for our
research has been made into the dynamics of a           experiments. While benign IRC logs are easily
Botnet, and more proactive techniques have been         available on the web, botnet affected IRC logs are
suggested [5] [6] [7] [8]. A wide variety of features   not readily available because of privacy and legal
and watermarks for network activity are employed        issues. The benign IRC logs were collected from
to predict Botnet activity, including TCP syn           several publicly available IRC logs from IRC
scanning, DNS monitoring, and extensive models          channels      like    Wikipedia,     linode,     and
of Botnet attack and propagation [9]. Despite all       kernelnewbies. These represent a diverse collection
                                                        of logs, with different purposes. Some of these logs
also have automated bots for channel maintenance        We used supervised learning techniques on groups
operations.                                             of benign executables vs. Botnet executables.
         Obtaining malicious IRC logs proved to be
extremely hard. There are several independent           3.1.1 Features
Security forums who actively track botnets [12].
They monitor and mine Botnet IRC logs for               We are focusing on binary profiling, and hex
analysis and mitigation. Due to privacy and security    dumps for feature extraction. Identifying the strain
issues, we were unable to obtain this data, which       of these binaries might also give an insight about
clearly represents a rich set of real world Botnet      the location of the Command-and-Control center
IRC logs, and which would have definitely               and about the botnet capabilities in general. We
provided more qualitative as well as quantitative       used n-grams (more specifically 4-grams) of
results. Several other potential sources setup their    hexdump of the binaries as our features. For
own private infrastructure for collecting such          example if the hexdump is ff 00 12 1a 32, our
training data [5].                                      features will be ff 00 12 1a and 0012 1a 32. We
         Nevertheless we acquired data from             extracted around more than a million features. We
Northwestern University where the department of         then used chi-square to select around 10,000 most
CS is conducting research on wireless overlay           informative features. For each feature, we find its
network architectures and botnet detection [7]. The     value as follows:
data regarding botnet IRC logs was not
comprehensive in the sense that it was IRC traffic                       feature=00121a32    feature≠00121a32
over a small amount of time. A larger and more          class=botnet     A                   C
comprehensive dataset could have established our        class≠botnet     B                   D
results and hypothesis more conclusively.
                                                                                         N ( AD − CB) 2
               III. APPROACHES                           χ 2 (botnet, f ) =
                                                                               ( A + C )(B + D)( A + B)(C + D)
         .
We tried a 2 stage approach to solve this issue.        Now we select the top 10,000 features with the
These methods are complementary and we can              highest chi-square scores.
combine them for better results. They are as
follows:                                                3.1.2 Classification
3.1 Botnet Binary Detection                             We used several classification algorithms to
                                                        classify the binaries into malicious/benign. The
There are several stages in the lifecycle of a Botnet   models we used are as follows:
where a Machine learning based solution can be          1. Multinomial Naïve Bayes
deployed to thwart its effectiveness. During an         2. linear SVM
early stage, a Binary detection and classification      3. kNN
system can warn of a potentially malicious                   This makes use of similarity metrics (e.g.
executable which might endanger a host machine.              cosine similarity) to find k nearest neighbors
There has been some work already in this area [7]            and classifies the point under consideration
and we leverage on top of their work to classify             into the majority class of its k nearest neighbor
Botnet executables which propagate as worms on               [15]. We used k=5.
networks scanning for vulnerable machines for           4. Logistic Regression
infecting them, and enlisting them into the Bot         5. Multiboost Adaboost - Class for boosting a
network.                                                     classifier using the               MultiBoosting
         Unlike Virus scanners like Norton AV, or            method. MultiBoosting is an extension to the
McAfee, a Machine learning solution can perform              highly successful AdaBoost technique for
this classification without the need for explicit            forming decision committees. MultiBoosting
signatures. Identifying such binaries without                can be viewed as combining AdaBoost with
explicit signatures needs recognizing common                 wagging. It is able to harness both AdaBoost's
features and correlations between these binaries,            high bias and variance reduction with
e.g. a Botnet executable will be a self-propagating          wagging's superior variance reduction. Using
worm with a simple IRC client built in. Presence or          C4.5 as the base learning algorithm, Multi-
absence of such a feature is an indicator that such a        boosting is demonstrated to produce decision
binary might potentially be a Botnet executable.             committees with lower error than either
AdaBoost or wagging significantly more often                   attributes not yet considered in the path from
     than the reverse over a large representative                   the root.
     cross-section of UCI data sets. It offers the
     further advantage over AdaBoost of suiting                For the purpose of our experiments we did not
     parallel execution. See [17] for more details.            make use of any separate testing data set. To stay
                                                               unbiased we used 10 fold cross validation to get the
6.   J48 Decision tree -This is an entropy based               results. We used off the shelf softwares such as
     approach for generating a pruned or unpruned              weak[19] and libsvm[13] for experimental results.
     C4.5 decision tree. For more information see
     [16]. In general, if we are given a probability           3.2 IRC log based detection
     distribution P = (p1, p2, .., pn) then the
     Information conveyed by this distribution, also                          IRC has played a central role in the
     called the Entropy of P, is:                                   simplicity and effectiveness of a Botnet. Using a
                                                                    public communication channel, the Botmaster can
I ( P) = −( p1 * log( p1 ) + p2 * log( p2 ) + ... + pn * log( pn )) use a simple command interface to communicate
                                                                    with a huge number of compromised zombie nodes,
                                                                    instructing them to carry out his orders.
       If a set T of records is partitioned into disjoint                      There are two phases to this approach:
      exhaustive classes C1, C2, .., Ck on the basis                First, to separate IRC traffic from other traffic. This
      of the value of the categorical attribute, then is a reasonably solved problem [5][6]. The second
      the information needed to identify the class of step comprises of identifying Botnet traffic in the
      an element of T is Info(T) = I(P), where P is IRC traffic. Hence the problem now boils down to
      the probability distribution of the partition (C1, a text classification problem.
      C2, .., Ck):                                                             To be able to differentiate a benign IRC
                                                                    log from an IRC log manifested with Botnet
                                                                    activity, we used features involving both dialogues
                                                                    and IRC commands. Then using these features, we
                      | C1 | | C 2 |          | Ck |                experimented with a variety of machine learning
               P=(           ,          ,...,        )              algorithms on them. In particular, we ran
                       |T | |T |               |T |
                                                                    algorithms such as Naïve Bayes, SVM, J48
                                                                    decision trees, kNN, etc. with 10 fold cross
      If we first partition T on the basis of the value validation. The main categories of features we used
      of a non-categorical attribute X into sets T1, T2, included:
      .., Tn then the information needed to identify
      the class of an element of T becomes the                           Number of users: An IRC channel with Botnet
      weighted average of the information needed to                      activity should contain an unusually large
      identify the class of an element of Ti, i.e. the                   number of users.
      weighted average of Info(Ti):                                      Mean / variance of words per line and
                                                                         characters per word in dialogues: Bots usually
                                n
                                   | Ti |                                do not produce dialogues that resemble human
           Info( X , T ) =∑   i =1 | T |
                                          * Info(Ti )                    dialogue.
                                                                         Number and frequency of IRC commands: We
                                                                         have noticed through examination of the logs
       Consider the quantity Gain(X,T) defined as                        that there tends to be a large number of IRC
                                                                         commands at small intervals in Botnet
        Gain( X , T ) = Info(T ) − Info( X , T )                         manifested logs. One possible explanation
                                                                         would be the immense number of joins and
       This represents the difference between the                        exits from the result of accommodating a huge
      information needed to identify an element of T                     number of users in one channel.
      and the information needed to identify an                          Number of lines, dialogues, and commands: In
      element of T after the value of attribute X has                    a benign IRC log, the number of dialogues
      been obtained, that is, this is the gain in                        should be much greater than the number of
      information due to attribute X. We can use this                    commands. And as mentioned above, a Botnet
      notion of gain to rank attributes and to build                     manifested log tends to contain an immense
      decision trees where at each node is located the                   number of IRC commands.
     attribute   with    greatest    gain   among      the
IV. RESULTS AND EVALUATION                       Model              Accuracy      F1       Kappa
                                                                                        score    score
4.1 Botnet Binary Detection
                                                       NB                 .70           .457     .705
The results obtained from the botnet binary based
detection approach are summarized in Fig. 2.           Linear SVM         .976          .945     .97
Clearly all the models performed reasonably well.
Special mention must be made about Naïve Bayes
which performed remarkably well although it is         J48 – decision .992              .982     .99
one of the simplest of models. SVM performed           tree
good too. However some models like kNN gave
an accuracy of 96.4 which was lower than that of       kNN                .992          .982     .991
others. We will discuss about these results in
details in the discussion section. In particular our
evaluation metric included accuracy, F1 score and      Logistic           .989          .974     .987
kappa measure.
                                                       MultiboostAB       .967          .926     .963

              # Correctly predicted data po int s
Accuracy =                                                   Fig. 3 Performance of IRC based detection
                     # Total data po int s


                                                       4.2 IRC log based detection
     2 * Pr ecision * Re call
F1 =
      Pr ecision + Re call                                      The results obtained from the IRC log
                                                       based detection approach are summarized in Fig. 3.
                                                       A large chunk of the problem here involves text
           Observed Agreement − Chance Agreement       classification, and all algorithms, barring Naïve
Kappa =                                                Bayes, perform encouragingly well.
             Total Observed − Chance Agreement
                                                                        V. DISCUSSION

                                                       In the binary based classification approach, most of
                                                       the linear classifiers such as NB and linear SVM
                                                       performed remarkably well. kNN however
Model              Accuracy      F1       Kappa        performed relatively bad. This can be explained on
                                 score    score        the basis of linearity of the data set. kNN is a non-
                   .982          .981     .963         linear model and hence it has less bias and a high
NB
                                                       variance unlike NB which is linear and has high
                                                       bias and less variance. Given our data set was
Linear SVM         .982          .981     .963         linear, it was therefore no surprise that kNN
                                                       exhibited a higher generalization error compared to
J48 – decision .976              .975     .95          NB or linear SVM.
tree                                                             In the IRC log based approach, most
                                                        classifiers performed similarly, except for Naïve
                                                        Bayes, which was significantly worse. One
kNN                .964          .963     .926          possible explanation would be that there are
                                                        dependencies in the features. These dependencies
Logistic           .982          .9815    .963          may be resolved with more training data. Since the
                                                        set of IRC logs we obtained was not very large and
                                                        comprehensive, it is possible that currently
MultiboostAB       .976          .975     .951          correlated features are not actually correlated in a
                                                        larger dataset.
     Fig. 2 Performance of binary based detection                The other classifiers performed very well,
                                                        all having accuracies greater than .96. Since SVM
and logistic regression had the best accuracies, F1      JHU for botnet binary datasets. We further
scores, and Kappa scores, it is likely that our          acknowledge our gratitude to Prof. Andrew Ng and
dataset is linear.                                       the TAs of cs229 for their help and support.
         In the logs that we collected with Botnet
activity, there weren’t strong attempts of log                          VIII. REFERENCES
obfuscation by Botmasters. But suppose that there
were strong attempts to produce human-like                [1] http://en.wikipedia.org/wiki/Botnet
dialogues by the bots. Then to potentially fool our       [2] http://en.wikipedia.org/wiki/Zombie_computer
classifiers, the dialogues must produce similar           [3]http://securitywatch.eweek.com/exploits_and_at
averages and variances in terms of words per line         tacks/everydns_opendns_under_botnet_ddos_attac
and characters per word.                                  k.html
         Pushing the above scenario to the extreme,       [4]http://it.slashdot.org/article.pl?sid=06/10/17/002
suppose that the bots produce perfect dialogues. In       251
this case, our classifier should still work due to one    [5] W. Timothy Strayer, Robert Walsh, Carl
fundamental difference between benign logs and            Livadas, and David Lapsley. Detecting Botnets
logs manifested with Botnet activity – there is an        with Tight Command and Control. To Appear in
immense number of join and exit commands in a             31st IEEE Conference on Local Computer
Botnet infested channel.                                  Networks (LCN'06). November 2006
                                                         [6] Using Machine Learning Techniques to Identify
                                                         Botnet Traffic, Carl Livadas, Robert Walsh, David
               VI. CONCLUSION                            Lapsley, W. Timothy Strayer
                                                         [7] Yan Chen, “Towards wireless overlay network
         In this course project we aimed to              architectures”
highlight a major problem plaguing the Internet          [8] Binkley- An Algorithm for Anomaly-based
today and a potential solution to the problem by         Botnet Detection .
employing novel Machine Learning techniques.             [9] E. Cooke, F. Jahanian, and D. McPherson. The
Our results are encouraging, indicating a definite       zombie roundup: Understanding, detecting, and
advantage of using such techniques in the wild for       disrupting botnets. In USENIX SRUTI Workshop,
alleviating the problem of Botnets.                      pages 39–44, 2005.
         Conversion of the problem in the Text           [10]http://www.vnunet.com/vnunet/news/2144375/
classification domain has opened up new avenues          botnet-operation-ruled-million
in feature extraction for our solution. Possible         [11] http://www.irc.org/tech_docs/rfc1459.html
future enhancements include adding additional            [12] www.shadowserver.org
features to the IRC traffic classification engine.       [13] http://www.csie.ntu.edu.tw/~cjlin/libsvm/
Also to solve the problem of training data which,        [14] Detecting Botnets with Tight Command and
in retrospect, seems to be the critical to the success   Control, Carl Livadas, Robert Walsh, David
of such a project, we propose setting up a farm of       Lapsley, W. Timothy Strayer.
Virtual machines as a testbed for collecting and           [15] Aha, D., and D. Kibler (1991) "Instance-
mining our own logs for training purpose.                 based learning algorithms", Machine Learning,
         Lastly, the end goal of such a tool should       vol.6, pp. 37-66.
be to provide proactive help to system                    [16] Ross Quinlan (1993). "C4.5: Programs for
administrators. In this regard, it is easy to envisage    Machine Learning", Morgan Kaufmann Publishers,
an application which can run on a Gateway or a            San Mateo, CA.
Router, and look at Network flow traffic, and             [17] Geoffrey I. Webb (2000). "MultiBoosting: A
classify it on-the-fly enabling automated                 Technique for Combining Boosting and Wagging".
blacklisting of involved machines in the network.         Machine Learning, 40(2): 159-196, Kluwer
                                                          Academic Publishers, Boston
                                                          [18] Moheeb Abu Rajab et.al (2006)”A
                                                          Multifaceted approach to understanding the botnet
         VII. ACKNOWLEDGEMENT                             phenonmenon”
                                                          [19] http://www.cs.waikato.ac.nz/ml/weka/
We take this opportunity to thank Elizabeth Stinson
of Stanford Security Group for her help and
support. We also extend our acknowledgements to
Yan Chen of northwestern university for providing
us with botnet IRC logs and to Andreas Terzis of

Weitere ähnliche Inhalte

Was ist angesagt?

Anti forensic
Anti forensicAnti forensic
Anti forensicMilap Oza
 
Password Cracking
Password CrackingPassword Cracking
Password CrackingSagar Verma
 
A2 - broken authentication and session management(OWASP thailand chapter Apri...
A2 - broken authentication and session management(OWASP thailand chapter Apri...A2 - broken authentication and session management(OWASP thailand chapter Apri...
A2 - broken authentication and session management(OWASP thailand chapter Apri...Noppadol Songsakaew
 
Vulnerability Management
Vulnerability ManagementVulnerability Management
Vulnerability Managementasherad
 
Malware detection-using-machine-learning
Malware detection-using-machine-learningMalware detection-using-machine-learning
Malware detection-using-machine-learningSecurity Bootcamp
 
DNS spoofing/poisoning Attack
DNS spoofing/poisoning AttackDNS spoofing/poisoning Attack
DNS spoofing/poisoning AttackFatima Qayyum
 
Seminar Presentation | Network Intrusion Detection using Supervised Machine L...
Seminar Presentation | Network Intrusion Detection using Supervised Machine L...Seminar Presentation | Network Intrusion Detection using Supervised Machine L...
Seminar Presentation | Network Intrusion Detection using Supervised Machine L...Jowin John Chemban
 
Ethical hacking Chapter 7 - Enumeration - Eric Vanderburg
Ethical hacking   Chapter 7 - Enumeration - Eric VanderburgEthical hacking   Chapter 7 - Enumeration - Eric Vanderburg
Ethical hacking Chapter 7 - Enumeration - Eric VanderburgEric Vanderburg
 
Vulnerability assessment and penetration testing
Vulnerability assessment and penetration testingVulnerability assessment and penetration testing
Vulnerability assessment and penetration testingAbu Sadat Mohammed Yasin
 
Data mining presentation.ppt
Data mining presentation.pptData mining presentation.ppt
Data mining presentation.pptneelamoberoi1030
 
DDoS Attack Detection and Botnet Prevention using Machine Learning
DDoS Attack Detection and Botnet Prevention using Machine LearningDDoS Attack Detection and Botnet Prevention using Machine Learning
DDoS Attack Detection and Botnet Prevention using Machine LearningIRJET Journal
 
Malicious software and software security
Malicious software and software  securityMalicious software and software  security
Malicious software and software securityG Prachi
 

Was ist angesagt? (20)

Malware Detection using Machine Learning
Malware Detection using Machine Learning	Malware Detection using Machine Learning
Malware Detection using Machine Learning
 
Anti forensic
Anti forensicAnti forensic
Anti forensic
 
Password Cracking
Password CrackingPassword Cracking
Password Cracking
 
A2 - broken authentication and session management(OWASP thailand chapter Apri...
A2 - broken authentication and session management(OWASP thailand chapter Apri...A2 - broken authentication and session management(OWASP thailand chapter Apri...
A2 - broken authentication and session management(OWASP thailand chapter Apri...
 
Vulnerability Management
Vulnerability ManagementVulnerability Management
Vulnerability Management
 
Malware detection-using-machine-learning
Malware detection-using-machine-learningMalware detection-using-machine-learning
Malware detection-using-machine-learning
 
DNS spoofing/poisoning Attack
DNS spoofing/poisoning AttackDNS spoofing/poisoning Attack
DNS spoofing/poisoning Attack
 
Bug bounty
Bug bountyBug bounty
Bug bounty
 
Seminar Presentation | Network Intrusion Detection using Supervised Machine L...
Seminar Presentation | Network Intrusion Detection using Supervised Machine L...Seminar Presentation | Network Intrusion Detection using Supervised Machine L...
Seminar Presentation | Network Intrusion Detection using Supervised Machine L...
 
Apriori Algorithm
Apriori AlgorithmApriori Algorithm
Apriori Algorithm
 
Ethical hacking Chapter 7 - Enumeration - Eric Vanderburg
Ethical hacking   Chapter 7 - Enumeration - Eric VanderburgEthical hacking   Chapter 7 - Enumeration - Eric Vanderburg
Ethical hacking Chapter 7 - Enumeration - Eric Vanderburg
 
Web mining
Web mining Web mining
Web mining
 
Linux forensics
Linux forensicsLinux forensics
Linux forensics
 
Malware analysis
Malware analysisMalware analysis
Malware analysis
 
Vulnerability assessment and penetration testing
Vulnerability assessment and penetration testingVulnerability assessment and penetration testing
Vulnerability assessment and penetration testing
 
Data mining presentation.ppt
Data mining presentation.pptData mining presentation.ppt
Data mining presentation.ppt
 
DDoS Attack Detection and Botnet Prevention using Machine Learning
DDoS Attack Detection and Botnet Prevention using Machine LearningDDoS Attack Detection and Botnet Prevention using Machine Learning
DDoS Attack Detection and Botnet Prevention using Machine Learning
 
Metasploit
MetasploitMetasploit
Metasploit
 
Malicious software and software security
Malicious software and software  securityMalicious software and software  security
Malicious software and software security
 
Module 02 ftk imager
Module 02 ftk imagerModule 02 ftk imager
Module 02 ftk imager
 

Ähnlich wie Machine Learning Based Botnet Detection

Botnet detection by Imitation method
Botnet detection  by Imitation methodBotnet detection  by Imitation method
Botnet detection by Imitation methodAcad
 
lab3cdga.ziplab3code.c#include stdio.h#include std.docx
lab3cdga.ziplab3code.c#include stdio.h#include std.docxlab3cdga.ziplab3code.c#include stdio.h#include std.docx
lab3cdga.ziplab3code.c#include stdio.h#include std.docxsmile790243
 
Lab3code.c#include stdio.h#include stdlib.h#include.docx
Lab3code.c#include stdio.h#include stdlib.h#include.docxLab3code.c#include stdio.h#include stdlib.h#include.docx
Lab3code.c#include stdio.h#include stdlib.h#include.docxsmile790243
 
An Efficient Framework for Detection & Classification of IoT BotNet.pptx
An Efficient Framework for Detection & Classification of IoT BotNet.pptxAn Efficient Framework for Detection & Classification of IoT BotNet.pptx
An Efficient Framework for Detection & Classification of IoT BotNet.pptxSandeep Maurya
 
Genetic Algorithm based Layered Detection and Defense of HTTP Botnet
Genetic Algorithm based Layered Detection and Defense of HTTP BotnetGenetic Algorithm based Layered Detection and Defense of HTTP Botnet
Genetic Algorithm based Layered Detection and Defense of HTTP BotnetIDES Editor
 
Detecting Victim Systems In Client Networks Using Coarse Grained Botnet Algor...
Detecting Victim Systems In Client Networks Using Coarse Grained Botnet Algor...Detecting Victim Systems In Client Networks Using Coarse Grained Botnet Algor...
Detecting Victim Systems In Client Networks Using Coarse Grained Botnet Algor...IRJET Journal
 
Tracing Back The Botmaster
Tracing Back The BotmasterTracing Back The Botmaster
Tracing Back The BotmasterIJERA Editor
 
Significance
SignificanceSignificance
SignificanceJulie May
 
A Survey of Botnet Detection Techniques
A Survey of Botnet Detection TechniquesA Survey of Botnet Detection Techniques
A Survey of Botnet Detection Techniquesijsrd.com
 
Synopsis viva presentation
Synopsis viva presentationSynopsis viva presentation
Synopsis viva presentationkirubavenkat
 
IRJET - Network Traffic Monitoring and Botnet Detection using K-ANN Algorithm
IRJET - Network Traffic Monitoring and Botnet Detection using K-ANN AlgorithmIRJET - Network Traffic Monitoring and Botnet Detection using K-ANN Algorithm
IRJET - Network Traffic Monitoring and Botnet Detection using K-ANN AlgorithmIRJET Journal
 
Lightweight C&C based botnet detection using Aho-Corasick NFA
Lightweight C&C based botnet detection using Aho-Corasick NFALightweight C&C based botnet detection using Aho-Corasick NFA
Lightweight C&C based botnet detection using Aho-Corasick NFAIJNSA Journal
 
Towards botnet detection through features using network traffic classification
Towards botnet detection through features using network traffic classificationTowards botnet detection through features using network traffic classification
Towards botnet detection through features using network traffic classificationIJERA Editor
 
Cloud Camp Milan 2K9 Telecom Italia: Where P2P?
Cloud Camp Milan 2K9 Telecom Italia: Where P2P?Cloud Camp Milan 2K9 Telecom Italia: Where P2P?
Cloud Camp Milan 2K9 Telecom Italia: Where P2P?Gabriele Bozzi
 
CloudCamp Milan 2009: Telecom Italia
CloudCamp Milan 2009: Telecom ItaliaCloudCamp Milan 2009: Telecom Italia
CloudCamp Milan 2009: Telecom ItaliaGabriele Bozzi
 
Bot net detection by using ssl encryption
Bot net detection by using ssl encryptionBot net detection by using ssl encryption
Bot net detection by using ssl encryptionAcad
 
A Dynamic Botnet Detection Model based on Behavior Analysis
A Dynamic Botnet Detection Model based on Behavior AnalysisA Dynamic Botnet Detection Model based on Behavior Analysis
A Dynamic Botnet Detection Model based on Behavior Analysisidescitation
 
A taxonomy of botnet detection approaches
A taxonomy of botnet detection approachesA taxonomy of botnet detection approaches
A taxonomy of botnet detection approachesFabrizio Farinacci
 

Ähnlich wie Machine Learning Based Botnet Detection (20)

Paper(edited)
Paper(edited)Paper(edited)
Paper(edited)
 
Botnet detection by Imitation method
Botnet detection  by Imitation methodBotnet detection  by Imitation method
Botnet detection by Imitation method
 
lab3cdga.ziplab3code.c#include stdio.h#include std.docx
lab3cdga.ziplab3code.c#include stdio.h#include std.docxlab3cdga.ziplab3code.c#include stdio.h#include std.docx
lab3cdga.ziplab3code.c#include stdio.h#include std.docx
 
Lab3code.c#include stdio.h#include stdlib.h#include.docx
Lab3code.c#include stdio.h#include stdlib.h#include.docxLab3code.c#include stdio.h#include stdlib.h#include.docx
Lab3code.c#include stdio.h#include stdlib.h#include.docx
 
An Efficient Framework for Detection & Classification of IoT BotNet.pptx
An Efficient Framework for Detection & Classification of IoT BotNet.pptxAn Efficient Framework for Detection & Classification of IoT BotNet.pptx
An Efficient Framework for Detection & Classification of IoT BotNet.pptx
 
Genetic Algorithm based Layered Detection and Defense of HTTP Botnet
Genetic Algorithm based Layered Detection and Defense of HTTP BotnetGenetic Algorithm based Layered Detection and Defense of HTTP Botnet
Genetic Algorithm based Layered Detection and Defense of HTTP Botnet
 
Detecting Victim Systems In Client Networks Using Coarse Grained Botnet Algor...
Detecting Victim Systems In Client Networks Using Coarse Grained Botnet Algor...Detecting Victim Systems In Client Networks Using Coarse Grained Botnet Algor...
Detecting Victim Systems In Client Networks Using Coarse Grained Botnet Algor...
 
Tracing Back The Botmaster
Tracing Back The BotmasterTracing Back The Botmaster
Tracing Back The Botmaster
 
Significance
SignificanceSignificance
Significance
 
A Survey of Botnet Detection Techniques
A Survey of Botnet Detection TechniquesA Survey of Botnet Detection Techniques
A Survey of Botnet Detection Techniques
 
Synopsis viva presentation
Synopsis viva presentationSynopsis viva presentation
Synopsis viva presentation
 
IRJET - Network Traffic Monitoring and Botnet Detection using K-ANN Algorithm
IRJET - Network Traffic Monitoring and Botnet Detection using K-ANN AlgorithmIRJET - Network Traffic Monitoring and Botnet Detection using K-ANN Algorithm
IRJET - Network Traffic Monitoring and Botnet Detection using K-ANN Algorithm
 
Lightweight C&C based botnet detection using Aho-Corasick NFA
Lightweight C&C based botnet detection using Aho-Corasick NFALightweight C&C based botnet detection using Aho-Corasick NFA
Lightweight C&C based botnet detection using Aho-Corasick NFA
 
Towards botnet detection through features using network traffic classification
Towards botnet detection through features using network traffic classificationTowards botnet detection through features using network traffic classification
Towards botnet detection through features using network traffic classification
 
Cloud Camp Milan 2K9 Telecom Italia: Where P2P?
Cloud Camp Milan 2K9 Telecom Italia: Where P2P?Cloud Camp Milan 2K9 Telecom Italia: Where P2P?
Cloud Camp Milan 2K9 Telecom Italia: Where P2P?
 
CloudCamp Milan 2009: Telecom Italia
CloudCamp Milan 2009: Telecom ItaliaCloudCamp Milan 2009: Telecom Italia
CloudCamp Milan 2009: Telecom Italia
 
NetBrain CE 5.0
NetBrain CE 5.0NetBrain CE 5.0
NetBrain CE 5.0
 
Bot net detection by using ssl encryption
Bot net detection by using ssl encryptionBot net detection by using ssl encryption
Bot net detection by using ssl encryption
 
A Dynamic Botnet Detection Model based on Behavior Analysis
A Dynamic Botnet Detection Model based on Behavior AnalysisA Dynamic Botnet Detection Model based on Behavior Analysis
A Dynamic Botnet Detection Model based on Behavior Analysis
 
A taxonomy of botnet detection approaches
A taxonomy of botnet detection approachesA taxonomy of botnet detection approaches
A taxonomy of botnet detection approaches
 

Mehr von butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

Mehr von butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Machine Learning Based Botnet Detection

  • 1. Machine Learning Based Botnet Detection Vaibhav Nivargi Mayukh Bhaowal Teddy Lee {vnivargi, mayukhb, tlee21}@cs.stanford.edu CS 229 Final Project Report I. INTRODUCTION these concerted efforts, Botnets remain an unsolved problem for the online community. A Botnet [1] is a large collection of compromised machines, referred to as zombies [2], under a common Command-and-Control infrastructure (C&C), typically used for nefarious purposes. Botnets are used in a variety of online crimes including, and not limited to, large scale DDoS attacks, Spam, Click Fraud, Extortion, and Identity theft. The scale and geographical diversity of the machines enlisted in a Botnet, coupled with easily available source code, and support from communities, as well as mercenary Botmasters providing Botnet services for rent, have resulted in Botnets becoming a highly sophisticated and effective tool for committing online crime in recent times [3][4]. Botnets with thousands and millions of nodes have been observed in the wild, with newer ones being observe every day [10]. The lifecycle of a Botnet is depicted in Fig.1. The initial enlisting occurs by exploiting a known vulnerability in the Operating systems running on the machines using a wide variety of Fig 1. Botnet in action mechanisms like worms, Trojans, P2P file sharing networks, and exploits of common Windows II. DATA vulnerabilities, etc. Once compromised, the bot is programmed to connect to a central location We had two separate data sets to collect for the (typically an IRC [11] server), where the Botmaster purpose of our experiments. The first set included a could login and issue commands to the logged in large number of binaries/executables labeled as bots. This mechanism essentially means that the botnet binaries and benign binaries. We acquired communication is free, as broadcast is taken care of the botnet binaries from the computer science by the IRC channel. Most Bots additionally ship department at John Hopkins University. This is the with multiple IRC and DNS addresses, meaning same data they used for botnet related work [18]. taking down one such IRC channel does not in As far as the benign binaries are concerned we general impair the activities of the Botnet. randomly picked them up from Unix and windows Originally, most techniques to thwart machines, namely from /usr/bin and from windows Botnets have been reactive, reducing their system32 directory. This data was comprehensive effectiveness significantly, e.g. using Honeypots to and well represented. trap and study malware, etc. Of late, significant We also needed labeled IRC logs for our research has been made into the dynamics of a experiments. While benign IRC logs are easily Botnet, and more proactive techniques have been available on the web, botnet affected IRC logs are suggested [5] [6] [7] [8]. A wide variety of features not readily available because of privacy and legal and watermarks for network activity are employed issues. The benign IRC logs were collected from to predict Botnet activity, including TCP syn several publicly available IRC logs from IRC scanning, DNS monitoring, and extensive models channels like Wikipedia, linode, and of Botnet attack and propagation [9]. Despite all kernelnewbies. These represent a diverse collection of logs, with different purposes. Some of these logs
  • 2. also have automated bots for channel maintenance We used supervised learning techniques on groups operations. of benign executables vs. Botnet executables. Obtaining malicious IRC logs proved to be extremely hard. There are several independent 3.1.1 Features Security forums who actively track botnets [12]. They monitor and mine Botnet IRC logs for We are focusing on binary profiling, and hex analysis and mitigation. Due to privacy and security dumps for feature extraction. Identifying the strain issues, we were unable to obtain this data, which of these binaries might also give an insight about clearly represents a rich set of real world Botnet the location of the Command-and-Control center IRC logs, and which would have definitely and about the botnet capabilities in general. We provided more qualitative as well as quantitative used n-grams (more specifically 4-grams) of results. Several other potential sources setup their hexdump of the binaries as our features. For own private infrastructure for collecting such example if the hexdump is ff 00 12 1a 32, our training data [5]. features will be ff 00 12 1a and 0012 1a 32. We Nevertheless we acquired data from extracted around more than a million features. We Northwestern University where the department of then used chi-square to select around 10,000 most CS is conducting research on wireless overlay informative features. For each feature, we find its network architectures and botnet detection [7]. The value as follows: data regarding botnet IRC logs was not comprehensive in the sense that it was IRC traffic feature=00121a32 feature≠00121a32 over a small amount of time. A larger and more class=botnet A C comprehensive dataset could have established our class≠botnet B D results and hypothesis more conclusively. N ( AD − CB) 2 III. APPROACHES χ 2 (botnet, f ) = ( A + C )(B + D)( A + B)(C + D) . We tried a 2 stage approach to solve this issue. Now we select the top 10,000 features with the These methods are complementary and we can highest chi-square scores. combine them for better results. They are as follows: 3.1.2 Classification 3.1 Botnet Binary Detection We used several classification algorithms to classify the binaries into malicious/benign. The There are several stages in the lifecycle of a Botnet models we used are as follows: where a Machine learning based solution can be 1. Multinomial Naïve Bayes deployed to thwart its effectiveness. During an 2. linear SVM early stage, a Binary detection and classification 3. kNN system can warn of a potentially malicious This makes use of similarity metrics (e.g. executable which might endanger a host machine. cosine similarity) to find k nearest neighbors There has been some work already in this area [7] and classifies the point under consideration and we leverage on top of their work to classify into the majority class of its k nearest neighbor Botnet executables which propagate as worms on [15]. We used k=5. networks scanning for vulnerable machines for 4. Logistic Regression infecting them, and enlisting them into the Bot 5. Multiboost Adaboost - Class for boosting a network. classifier using the MultiBoosting Unlike Virus scanners like Norton AV, or method. MultiBoosting is an extension to the McAfee, a Machine learning solution can perform highly successful AdaBoost technique for this classification without the need for explicit forming decision committees. MultiBoosting signatures. Identifying such binaries without can be viewed as combining AdaBoost with explicit signatures needs recognizing common wagging. It is able to harness both AdaBoost's features and correlations between these binaries, high bias and variance reduction with e.g. a Botnet executable will be a self-propagating wagging's superior variance reduction. Using worm with a simple IRC client built in. Presence or C4.5 as the base learning algorithm, Multi- absence of such a feature is an indicator that such a boosting is demonstrated to produce decision binary might potentially be a Botnet executable. committees with lower error than either
  • 3. AdaBoost or wagging significantly more often attributes not yet considered in the path from than the reverse over a large representative the root. cross-section of UCI data sets. It offers the further advantage over AdaBoost of suiting For the purpose of our experiments we did not parallel execution. See [17] for more details. make use of any separate testing data set. To stay unbiased we used 10 fold cross validation to get the 6. J48 Decision tree -This is an entropy based results. We used off the shelf softwares such as approach for generating a pruned or unpruned weak[19] and libsvm[13] for experimental results. C4.5 decision tree. For more information see [16]. In general, if we are given a probability 3.2 IRC log based detection distribution P = (p1, p2, .., pn) then the Information conveyed by this distribution, also IRC has played a central role in the called the Entropy of P, is: simplicity and effectiveness of a Botnet. Using a public communication channel, the Botmaster can I ( P) = −( p1 * log( p1 ) + p2 * log( p2 ) + ... + pn * log( pn )) use a simple command interface to communicate with a huge number of compromised zombie nodes, instructing them to carry out his orders. If a set T of records is partitioned into disjoint There are two phases to this approach: exhaustive classes C1, C2, .., Ck on the basis First, to separate IRC traffic from other traffic. This of the value of the categorical attribute, then is a reasonably solved problem [5][6]. The second the information needed to identify the class of step comprises of identifying Botnet traffic in the an element of T is Info(T) = I(P), where P is IRC traffic. Hence the problem now boils down to the probability distribution of the partition (C1, a text classification problem. C2, .., Ck): To be able to differentiate a benign IRC log from an IRC log manifested with Botnet activity, we used features involving both dialogues and IRC commands. Then using these features, we | C1 | | C 2 | | Ck | experimented with a variety of machine learning P=( , ,..., ) algorithms on them. In particular, we ran |T | |T | |T | algorithms such as Naïve Bayes, SVM, J48 decision trees, kNN, etc. with 10 fold cross If we first partition T on the basis of the value validation. The main categories of features we used of a non-categorical attribute X into sets T1, T2, included: .., Tn then the information needed to identify the class of an element of T becomes the Number of users: An IRC channel with Botnet weighted average of the information needed to activity should contain an unusually large identify the class of an element of Ti, i.e. the number of users. weighted average of Info(Ti): Mean / variance of words per line and characters per word in dialogues: Bots usually n | Ti | do not produce dialogues that resemble human Info( X , T ) =∑ i =1 | T | * Info(Ti ) dialogue. Number and frequency of IRC commands: We have noticed through examination of the logs Consider the quantity Gain(X,T) defined as that there tends to be a large number of IRC commands at small intervals in Botnet Gain( X , T ) = Info(T ) − Info( X , T ) manifested logs. One possible explanation would be the immense number of joins and This represents the difference between the exits from the result of accommodating a huge information needed to identify an element of T number of users in one channel. and the information needed to identify an Number of lines, dialogues, and commands: In element of T after the value of attribute X has a benign IRC log, the number of dialogues been obtained, that is, this is the gain in should be much greater than the number of information due to attribute X. We can use this commands. And as mentioned above, a Botnet notion of gain to rank attributes and to build manifested log tends to contain an immense decision trees where at each node is located the number of IRC commands. attribute with greatest gain among the
  • 4. IV. RESULTS AND EVALUATION Model Accuracy F1 Kappa score score 4.1 Botnet Binary Detection NB .70 .457 .705 The results obtained from the botnet binary based detection approach are summarized in Fig. 2. Linear SVM .976 .945 .97 Clearly all the models performed reasonably well. Special mention must be made about Naïve Bayes which performed remarkably well although it is J48 – decision .992 .982 .99 one of the simplest of models. SVM performed tree good too. However some models like kNN gave an accuracy of 96.4 which was lower than that of kNN .992 .982 .991 others. We will discuss about these results in details in the discussion section. In particular our evaluation metric included accuracy, F1 score and Logistic .989 .974 .987 kappa measure. MultiboostAB .967 .926 .963 # Correctly predicted data po int s Accuracy = Fig. 3 Performance of IRC based detection # Total data po int s 4.2 IRC log based detection 2 * Pr ecision * Re call F1 = Pr ecision + Re call The results obtained from the IRC log based detection approach are summarized in Fig. 3. A large chunk of the problem here involves text Observed Agreement − Chance Agreement classification, and all algorithms, barring Naïve Kappa = Bayes, perform encouragingly well. Total Observed − Chance Agreement V. DISCUSSION In the binary based classification approach, most of the linear classifiers such as NB and linear SVM performed remarkably well. kNN however Model Accuracy F1 Kappa performed relatively bad. This can be explained on score score the basis of linearity of the data set. kNN is a non- .982 .981 .963 linear model and hence it has less bias and a high NB variance unlike NB which is linear and has high bias and less variance. Given our data set was Linear SVM .982 .981 .963 linear, it was therefore no surprise that kNN exhibited a higher generalization error compared to J48 – decision .976 .975 .95 NB or linear SVM. tree In the IRC log based approach, most classifiers performed similarly, except for Naïve Bayes, which was significantly worse. One kNN .964 .963 .926 possible explanation would be that there are dependencies in the features. These dependencies Logistic .982 .9815 .963 may be resolved with more training data. Since the set of IRC logs we obtained was not very large and comprehensive, it is possible that currently MultiboostAB .976 .975 .951 correlated features are not actually correlated in a larger dataset. Fig. 2 Performance of binary based detection The other classifiers performed very well, all having accuracies greater than .96. Since SVM
  • 5. and logistic regression had the best accuracies, F1 JHU for botnet binary datasets. We further scores, and Kappa scores, it is likely that our acknowledge our gratitude to Prof. Andrew Ng and dataset is linear. the TAs of cs229 for their help and support. In the logs that we collected with Botnet activity, there weren’t strong attempts of log VIII. REFERENCES obfuscation by Botmasters. But suppose that there were strong attempts to produce human-like [1] http://en.wikipedia.org/wiki/Botnet dialogues by the bots. Then to potentially fool our [2] http://en.wikipedia.org/wiki/Zombie_computer classifiers, the dialogues must produce similar [3]http://securitywatch.eweek.com/exploits_and_at averages and variances in terms of words per line tacks/everydns_opendns_under_botnet_ddos_attac and characters per word. k.html Pushing the above scenario to the extreme, [4]http://it.slashdot.org/article.pl?sid=06/10/17/002 suppose that the bots produce perfect dialogues. In 251 this case, our classifier should still work due to one [5] W. Timothy Strayer, Robert Walsh, Carl fundamental difference between benign logs and Livadas, and David Lapsley. Detecting Botnets logs manifested with Botnet activity – there is an with Tight Command and Control. To Appear in immense number of join and exit commands in a 31st IEEE Conference on Local Computer Botnet infested channel. Networks (LCN'06). November 2006 [6] Using Machine Learning Techniques to Identify Botnet Traffic, Carl Livadas, Robert Walsh, David VI. CONCLUSION Lapsley, W. Timothy Strayer [7] Yan Chen, “Towards wireless overlay network In this course project we aimed to architectures” highlight a major problem plaguing the Internet [8] Binkley- An Algorithm for Anomaly-based today and a potential solution to the problem by Botnet Detection . employing novel Machine Learning techniques. [9] E. Cooke, F. Jahanian, and D. McPherson. The Our results are encouraging, indicating a definite zombie roundup: Understanding, detecting, and advantage of using such techniques in the wild for disrupting botnets. In USENIX SRUTI Workshop, alleviating the problem of Botnets. pages 39–44, 2005. Conversion of the problem in the Text [10]http://www.vnunet.com/vnunet/news/2144375/ classification domain has opened up new avenues botnet-operation-ruled-million in feature extraction for our solution. Possible [11] http://www.irc.org/tech_docs/rfc1459.html future enhancements include adding additional [12] www.shadowserver.org features to the IRC traffic classification engine. [13] http://www.csie.ntu.edu.tw/~cjlin/libsvm/ Also to solve the problem of training data which, [14] Detecting Botnets with Tight Command and in retrospect, seems to be the critical to the success Control, Carl Livadas, Robert Walsh, David of such a project, we propose setting up a farm of Lapsley, W. Timothy Strayer. Virtual machines as a testbed for collecting and [15] Aha, D., and D. Kibler (1991) "Instance- mining our own logs for training purpose. based learning algorithms", Machine Learning, Lastly, the end goal of such a tool should vol.6, pp. 37-66. be to provide proactive help to system [16] Ross Quinlan (1993). "C4.5: Programs for administrators. In this regard, it is easy to envisage Machine Learning", Morgan Kaufmann Publishers, an application which can run on a Gateway or a San Mateo, CA. Router, and look at Network flow traffic, and [17] Geoffrey I. Webb (2000). "MultiBoosting: A classify it on-the-fly enabling automated Technique for Combining Boosting and Wagging". blacklisting of involved machines in the network. Machine Learning, 40(2): 159-196, Kluwer Academic Publishers, Boston [18] Moheeb Abu Rajab et.al (2006)”A Multifaceted approach to understanding the botnet VII. ACKNOWLEDGEMENT phenonmenon” [19] http://www.cs.waikato.ac.nz/ml/weka/ We take this opportunity to thank Elizabeth Stinson of Stanford Security Group for her help and support. We also extend our acknowledgements to Yan Chen of northwestern university for providing us with botnet IRC logs and to Andreas Terzis of