Research Project - Master's in Data Analytics
Applying different statistical and machine learning techniques learned as a part of Data Analytics coursework is applied on Thesis Project to solve the malicious web page detection.
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Classifying malicious websites using an ensemble weighted features
1. Detecting MaliciousWeb Pages
Using An EnsembleWeighted
Average Model
- Research Project Presentation
Dharmendra Lalji
Vishwakarma
X18108181
MSc in DataAnalytics –
CohortA
September 2018-19
2. Area of Study & Motivation
Increase in internet Users
- Popularity of Cyber
Crimes
- Websites as a medium
of attack
Cyber-criminal activities such as ransomware, botnet,
information stealing, and DDOS etc.
- Leads to loss of Information privacy
- Loss to the businesses
1 2
3. Present Solutions –
1. Education & Legislation
2. Hand Crafted Techniques
1. Static Technique - Black-listing & White-listing Approach.
2. Dynamic Technique – Useful for creating blacklists
3. Intelligent Machine learning models – Using features present in the
malicious webpage.
1. Recent case study – Keyword-density approach (Altay et al., 2018)
3
4. Research Question
How can weighted average ensemble of features set of keyword-density, URL
features and JavaScript Code offer substantial improvements to keyword-
density predictor in identifying malicious web pages?
5. ResearchObjectives
• Analysing the important attributes such as URL length for URL
characteristics in distinguishing malicious class.
• Reproducing the keyword-density methods of classifying webpages. It acts
as a baseline model over an improved version of classification for the similar
dataset.
• Experimenting with each independent feature against the outcome to see
their contribution in the prediction.
• Dynamically calculating the weights for each feature set for classification
using an ensemble weighted approach.
6. Literature Review
• Detection of malicious websites using URL features
• (Chakraborty and Lin, 2017) and (Kim et al., 2018)
• Malicious websites detection using JavaScript codes
• (Liu et al., 2018) and (Stokes et al., 2018)
• Using machine learning with a content-based approach
• (Altay et al., 2018) and (Saxe et al., 2018)
• Using Hybrid features approach
• (Akiyama et al., 2017) and (Kazemian and Ahmed, 2015)
• Review of Ensemble learning
• (Nagaraj et al., 2018) and (Anne Ubing et al., 2019)
11. Features
Extraction
- HTML
• Sklearn pipeline –
TF-IDFVectoriser module
• Takes care ofText processing
such as tokenisation, stop word
removal, stemming & n-grams.
19. Discussion
• URL based models are proved to be a best classifier.
• Dataset difference (2019)
• Data extraction differences (Tools, Legal policies & Techniques)
20. FutureWork
• Browser plugins
• More features can be added such as DNS, Server relations.
• Combination of Static & Dynamic techniques.
• Predicting more broader categories of classes. E.g. Threat Types.
21. References
• Altay, B., Dokeroglu, T. and Cosar, A. (2018). Context-sensitive and keyword density-based supervised machine
learning techniques for malicious webpage detection, Soft Computing.
• Chakraborty, G. and Lin, T. T. (2017). A url address aware classification of malicious websites for online security
during web-surfing, 2017 IEEE International Conference on Advanced Networks and Telecommunications
Systems (ANTS), pp. 1-6.
• Kim, S., Kim, J., Nam, S. and Kim, D. (2018). Webmon: Ml- and yara-based malicious webpage detection,
Computer Networks 137: 119-131.
• Liu, J., Xu, M., Wang, X., Shen, S. and Li, M. (2018). A markov detection tree-based centralized scheme to
automatically identify malicious webpages on cloud platforms, IEEE Access 6: 74025-74038.
• Messabi, K. A., Aldwairi, M., Yousif, A. A., Thoban, A. and Belqasmi, F. (2018). Malware detection using dns
records and domain name features, Proceedings of the 2Nd International Conference on Future Networks and
Distributed Systems, ICFNDS '18, ACM, New York, NY, USA, pp. 29:1-29:7.
• Saxe, J., Harang, R. E., Wild, C. and Sanders, H. (2018). A deep learning approach to fast,format-agnostic
detection of malicious web content, CoRR abs/1804.05020.
• Seifert, C., Welch, I., Komisarczuk, P., Aval, C. U. and Endicott-Popovsky, B. (2008). Identification of malicious
web pages through analysis of underlying dns and web server relationships, 2008 33rd IEEE Conference on
Local Computer Networks (LCN), pp. 935-941.
• Stokes, J. W., Agrawal, R. and McDonald, G. (2018). Neural classification of malicious scripts: A study with
javascript and vbscript, CoRR abs/1805.05603.
• Wirth, R. (2000). Crisp-dm: Towards a standard process model for data mining, Proceedings of the Fourth
International Conference on the Practical Application of Knowledge Discovery and Data Mining, pp. 29-39.
Hello Everyone! My name is Dharmendra Vishwakarma. This is a presentation of the Research Project for Master’s in Data Analytics course. The research topic is on “Detecting malicious web pages using an ensemble weighted average model”.
The area of my study is a mix of both in cyber security and data analytics domain.
1. With advancement in communication technologies and ever-increasing internet, most of the services are online nowadays such as e-banking, social networking, e-commerce and entertainment, etc.
Due to the easy availability of services and information, users tend to browse the internet freely without knowing the negative side of it. These services are exploited by cyber attackers to
steal useful and private user-sensitive information.
2. The cyber-attackers use websites as a medium to redirect users to their malicious network for further attacks or using drive-by-download
software to install malware locally on the user’s computer. This enables attackers to perform other cyber-criminal activities such as ransomware, botnet, information stealing, and DDOS
etc. These leads to loss of information privacy and many cases loss to the businesses.
To solve this problem, there are primarily three categories of solutions are present.
Firstly, users are given knowledge about the prevention techniques in the form of education and
legislation through government initiative to discourage such activities. However, due to the busy nature of the business, people often tend to make a mistake in a real-world scenario.
The second approach consists of preparing computerised hand-crafted techniques to prevent phishing activities. It usually involves static techniques such as blacklist and white-listing approach.
A dynamic approach is used wherein a virtual sandbox environment is used to observe the behavior of web pages in order to detect the presence of deceptive nature. But this method is not ideal for real-time detection and can be employed for creating a blacklist of URLs.
Lastly, intelligent machine learning models are used for solving this problem using features present in the website. Recent study using a keyword density-based approach for detecting malicious websites has shown significant accuracy. However, the content present on the page can not be a significant factor alone that contributes
towards the deceptive nature of the website given that varying nature of the attack.
So, research question for my proposal is “”
And The specific objectives of this research is “”
In this research proposal, there is a consideration of various other important factors along with the content-based approach. These factors
are URL based features, DNS information, Server details, JavaScript codes present on the page. These factors can contribute to making the final decision as URL alone cannot
efficiently detect phishing behaviour of the website.
The main contribution of this research will be using an ensemble learning in deciding the final classification result using individual models.
The literature review suggest following trends.
Many authors have considered different features from malicious websites such as URL, DNS, JavaScript and page contents.
All these previous researches considered different aspects of malicious threats to develop solutions. However, there is a need to develop a hybrid set of solutions which can detect malicious content even if one feature set fails to detect it. For instance, web threats can appear in many forms within the page such as XSS, phishing, a DDOS attack. The idea is to consider weighted impact on the final decision.
The research methodology is based on the CRISP-DM which is a successful methodology for data mining projects.
Therefore, each task of the research is majorly divided into 6 phases as per the CRISP-DM paradigm.
The dataset for this research is as follows -
100 thousand benign URLs will be extracted from Alexa and 20 thousand malicious URLs will be downloaded from PhishTank.
Both datasets have been previously used in the literature. Since a comparison will be made over baseline model, same dataset is considered.
The dataset for this research is as follows -
100 thousand benign URLs will be extracted from Alexa and 20 thousand malicious URLs will be downloaded from PhishTank.
Both datasets have been previously used in the literature. Since a comparison will be made over baseline model, same dataset is considered.
The dataset for this research is as follows -
100 thousand benign URLs will be extracted from Alexa and 20 thousand malicious URLs will be downloaded from PhishTank.
Both datasets have been previously used in the literature. Since a comparison will be made over baseline model, same dataset is considered.
The dataset for this research is as follows -
100 thousand benign URLs will be extracted from Alexa and 20 thousand malicious URLs will be downloaded from PhishTank.
Both datasets have been previously used in the literature. Since a comparison will be made over baseline model, same dataset is considered.
The dataset for this research is as follows -
100 thousand benign URLs will be extracted from Alexa and 20 thousand malicious URLs will be downloaded from PhishTank.
Both datasets have been previously used in the literature. Since a comparison will be made over baseline model, same dataset is considered.
Box-plot for outlier detection
- URL length shows outlier, further explored by classes
# data is not normally distributed.
# most of the data is right skewed
#correleated attributes are detected
# for example, cookies_ref_count related to setinterval time
# rest all seems fine. and equally important for the model building
The Implementation is as follows
Web pages from the dataset is extracted and stored along with the URLs. The features related to keyword-density, URL, JavaScript code and DNS server relationships are extracted using feature extraction process. This features with class variable is supplied to individual machine learning models. Their outcome is given as input for weighted ensemble model. This way dynamic weights are be determined and trained model will be generated. Entire process is splitted into training and prediction. During prediction, unseen web pages is evaluated on predictive model. The evaluation is conducted using Precision, Recall, F1-score, Area Under the ROC curve and 10-fold cross validation. Furthermore, statistical test is carried out to check significance of model.
Ensemble techniques has lowest error among other individual models.
These are the references used in the presentation.