SlideShare ist ein Scribd-Unternehmen logo
1 von 22
Detecting MaliciousWeb Pages
Using An EnsembleWeighted
Average Model
- Research Project Presentation
Dharmendra Lalji
Vishwakarma
X18108181
MSc in DataAnalytics –
CohortA
September 2018-19
Area of Study & Motivation
Increase in internet Users
- Popularity of Cyber
Crimes
- Websites as a medium
of attack
Cyber-criminal activities such as ransomware, botnet,
information stealing, and DDOS etc.
- Leads to loss of Information privacy
- Loss to the businesses
1 2
Present Solutions –
1. Education & Legislation
2. Hand Crafted Techniques
1. Static Technique - Black-listing & White-listing Approach.
2. Dynamic Technique – Useful for creating blacklists
3. Intelligent Machine learning models – Using features present in the
malicious webpage.
1. Recent case study – Keyword-density approach (Altay et al., 2018)
3
Research Question
How can weighted average ensemble of features set of keyword-density, URL
features and JavaScript Code offer substantial improvements to keyword-
density predictor in identifying malicious web pages?
ResearchObjectives
• Analysing the important attributes such as URL length for URL
characteristics in distinguishing malicious class.
• Reproducing the keyword-density methods of classifying webpages. It acts
as a baseline model over an improved version of classification for the similar
dataset.
• Experimenting with each independent feature against the outcome to see
their contribution in the prediction.
• Dynamically calculating the weights for each feature set for classification
using an ensemble weighted approach.
Literature Review
• Detection of malicious websites using URL features
• (Chakraborty and Lin, 2017) and (Kim et al., 2018)
• Malicious websites detection using JavaScript codes
• (Liu et al., 2018) and (Stokes et al., 2018)
• Using machine learning with a content-based approach
• (Altay et al., 2018) and (Saxe et al., 2018)
• Using Hybrid features approach
• (Akiyama et al., 2017) and (Kazemian and Ahmed, 2015)
• Review of Ensemble learning
• (Nagaraj et al., 2018) and (Anne Ubing et al., 2019)
Research Methodology
• CRISP-DM Methodology
(Wirth, 2000)
Data Set Description
• Sources:
• Alexa – Benign Websites
• PhishTank – Malicious Websites
Features
Extraction
- URL
Features
Extraction -
JavaScript
Features
Extraction
- HTML
• Sklearn pipeline –
TF-IDFVectoriser module
• Takes care ofText processing
such as tokenisation, stop word
removal, stemming & n-grams.
Final generated Data
Set
After Data Cleansing – Duplicates &
Null values
EDA-1
EDA-2
EDA-3
Implementation
Results
Final Ensemble Results
OptimalWeights –
(2,3,2) – (URL, JS, KW)
Comparison Results
McNemar’sTest on contingency table
- Statistically showed difference in developed models.
- α = 0.05, p < 0.05
Discussion
• URL based models are proved to be a best classifier.
• Dataset difference (2019)
• Data extraction differences (Tools, Legal policies & Techniques)
FutureWork
• Browser plugins
• More features can be added such as DNS, Server relations.
• Combination of Static & Dynamic techniques.
• Predicting more broader categories of classes. E.g. Threat Types.
References
• Altay, B., Dokeroglu, T. and Cosar, A. (2018). Context-sensitive and keyword density-based supervised machine
learning techniques for malicious webpage detection, Soft Computing.
• Chakraborty, G. and Lin, T. T. (2017). A url address aware classification of malicious websites for online security
during web-surfing, 2017 IEEE International Conference on Advanced Networks and Telecommunications
Systems (ANTS), pp. 1-6.
• Kim, S., Kim, J., Nam, S. and Kim, D. (2018). Webmon: Ml- and yara-based malicious webpage detection,
Computer Networks 137: 119-131.
• Liu, J., Xu, M., Wang, X., Shen, S. and Li, M. (2018). A markov detection tree-based centralized scheme to
automatically identify malicious webpages on cloud platforms, IEEE Access 6: 74025-74038.
• Messabi, K. A., Aldwairi, M., Yousif, A. A., Thoban, A. and Belqasmi, F. (2018). Malware detection using dns
records and domain name features, Proceedings of the 2Nd International Conference on Future Networks and
Distributed Systems, ICFNDS '18, ACM, New York, NY, USA, pp. 29:1-29:7.
• Saxe, J., Harang, R. E., Wild, C. and Sanders, H. (2018). A deep learning approach to fast,format-agnostic
detection of malicious web content, CoRR abs/1804.05020.
• Seifert, C., Welch, I., Komisarczuk, P., Aval, C. U. and Endicott-Popovsky, B. (2008). Identification of malicious
web pages through analysis of underlying dns and web server relationships, 2008 33rd IEEE Conference on
Local Computer Networks (LCN), pp. 935-941.
• Stokes, J. W., Agrawal, R. and McDonald, G. (2018). Neural classification of malicious scripts: A study with
javascript and vbscript, CoRR abs/1805.05603.
• Wirth, R. (2000). Crisp-dm: Towards a standard process model for data mining, Proceedings of the Fourth
International Conference on the Practical Application of Knowledge Discovery and Data Mining, pp. 29-39.
ThankYou!

Weitere ähnliche Inhalte

Was ist angesagt?

Scratchpads: the Virtual Research Environment for biodiversity data
Scratchpads: the Virtual Research Environment for biodiversity dataScratchpads: the Virtual Research Environment for biodiversity data
Scratchpads: the Virtual Research Environment for biodiversity dataVince Smith
 
Research Topics on Data Mining
Research Topics on Data MiningResearch Topics on Data Mining
Research Topics on Data MiningPhdtopiccom
 
Major issues in data mining
Major issues in data miningMajor issues in data mining
Major issues in data miningYashwant Rautela
 
Towards embedded Markup of Learning Resources on the Web
Towards embedded Markup of Learning Resources on the WebTowards embedded Markup of Learning Resources on the Web
Towards embedded Markup of Learning Resources on the WebStefan Dietze
 
Information Convergence in the Long Tail
Information Convergence in the Long TailInformation Convergence in the Long Tail
Information Convergence in the Long TailAlessandro Inversini
 
GENI Engineering Conference -- Ian Foster
GENI Engineering Conference -- Ian FosterGENI Engineering Conference -- Ian Foster
GENI Engineering Conference -- Ian FosterIan Foster
 
CLAIR: Computational Linguistics And Information Retrieval
CLAIR: Computational Linguistics And Information RetrievalCLAIR: Computational Linguistics And Information Retrieval
CLAIR: Computational Linguistics And Information Retrievalbutest
 
Memory Connected
Memory ConnectedMemory Connected
Memory ConnectedLi Ding
 
Image Processing Phd Thesis Projects
Image Processing Phd Thesis ProjectsImage Processing Phd Thesis Projects
Image Processing Phd Thesis ProjectsPhdtopiccom
 
Visualising Dissertations on Electronic Literature (Visualising E-lit seminar...
Visualising Dissertations on Electronic Literature (Visualising E-lit seminar...Visualising Dissertations on Electronic Literature (Visualising E-lit seminar...
Visualising Dissertations on Electronic Literature (Visualising E-lit seminar...Jill Walker Rettberg
 
SLA Silicon Valley 2013 Altmetrics
SLA Silicon Valley 2013 AltmetricsSLA Silicon Valley 2013 Altmetrics
SLA Silicon Valley 2013 AltmetricsWilliam Gunn
 
Being A Good Data Provider
Being A Good Data ProviderBeing A Good Data Provider
Being A Good Data ProviderAlastair Dunning
 
Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013University of Washington
 
Being a Good Data Provider, by Alastair Dunning
Being a Good Data Provider, by Alastair DunningBeing a Good Data Provider, by Alastair Dunning
Being a Good Data Provider, by Alastair DunningAlastair Dunning
 
Search, Discovery and Analysis of Sensory Data Streams
Search, Discovery and Analysis of Sensory Data StreamsSearch, Discovery and Analysis of Sensory Data Streams
Search, Discovery and Analysis of Sensory Data StreamsPayamBarnaghi
 
American Art Collaborative Linked Open Data presentation to "The Networked Cu...
American Art Collaborative Linked Open Data presentation to "The Networked Cu...American Art Collaborative Linked Open Data presentation to "The Networked Cu...
American Art Collaborative Linked Open Data presentation to "The Networked Cu...American Art Collaborative
 
Real time twitter trend mining system – rt2 m
Real time twitter trend mining system – rt2 mReal time twitter trend mining system – rt2 m
Real time twitter trend mining system – rt2 mNigar Gasimli
 

Was ist angesagt? (18)

Scratchpads: the Virtual Research Environment for biodiversity data
Scratchpads: the Virtual Research Environment for biodiversity dataScratchpads: the Virtual Research Environment for biodiversity data
Scratchpads: the Virtual Research Environment for biodiversity data
 
Research Topics on Data Mining
Research Topics on Data MiningResearch Topics on Data Mining
Research Topics on Data Mining
 
Major issues in data mining
Major issues in data miningMajor issues in data mining
Major issues in data mining
 
Towards embedded Markup of Learning Resources on the Web
Towards embedded Markup of Learning Resources on the WebTowards embedded Markup of Learning Resources on the Web
Towards embedded Markup of Learning Resources on the Web
 
Information Convergence in the Long Tail
Information Convergence in the Long TailInformation Convergence in the Long Tail
Information Convergence in the Long Tail
 
GENI Engineering Conference -- Ian Foster
GENI Engineering Conference -- Ian FosterGENI Engineering Conference -- Ian Foster
GENI Engineering Conference -- Ian Foster
 
CLAIR: Computational Linguistics And Information Retrieval
CLAIR: Computational Linguistics And Information RetrievalCLAIR: Computational Linguistics And Information Retrieval
CLAIR: Computational Linguistics And Information Retrieval
 
Courses Completed
Courses CompletedCourses Completed
Courses Completed
 
Memory Connected
Memory ConnectedMemory Connected
Memory Connected
 
Image Processing Phd Thesis Projects
Image Processing Phd Thesis ProjectsImage Processing Phd Thesis Projects
Image Processing Phd Thesis Projects
 
Visualising Dissertations on Electronic Literature (Visualising E-lit seminar...
Visualising Dissertations on Electronic Literature (Visualising E-lit seminar...Visualising Dissertations on Electronic Literature (Visualising E-lit seminar...
Visualising Dissertations on Electronic Literature (Visualising E-lit seminar...
 
SLA Silicon Valley 2013 Altmetrics
SLA Silicon Valley 2013 AltmetricsSLA Silicon Valley 2013 Altmetrics
SLA Silicon Valley 2013 Altmetrics
 
Being A Good Data Provider
Being A Good Data ProviderBeing A Good Data Provider
Being A Good Data Provider
 
Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013
 
Being a Good Data Provider, by Alastair Dunning
Being a Good Data Provider, by Alastair DunningBeing a Good Data Provider, by Alastair Dunning
Being a Good Data Provider, by Alastair Dunning
 
Search, Discovery and Analysis of Sensory Data Streams
Search, Discovery and Analysis of Sensory Data StreamsSearch, Discovery and Analysis of Sensory Data Streams
Search, Discovery and Analysis of Sensory Data Streams
 
American Art Collaborative Linked Open Data presentation to "The Networked Cu...
American Art Collaborative Linked Open Data presentation to "The Networked Cu...American Art Collaborative Linked Open Data presentation to "The Networked Cu...
American Art Collaborative Linked Open Data presentation to "The Networked Cu...
 
Real time twitter trend mining system – rt2 m
Real time twitter trend mining system – rt2 mReal time twitter trend mining system – rt2 m
Real time twitter trend mining system – rt2 m
 

Ähnlich wie Classifying malicious websites using an ensemble weighted features

A Review on Pattern Discovery Techniques of Web Usage Mining
A Review on Pattern Discovery Techniques of Web Usage MiningA Review on Pattern Discovery Techniques of Web Usage Mining
A Review on Pattern Discovery Techniques of Web Usage MiningIJERA Editor
 
Phishing Website Detection Using Machine Learning
Phishing Website Detection Using Machine LearningPhishing Website Detection Using Machine Learning
Phishing Website Detection Using Machine LearningIRJET Journal
 
A Hybrid Approach For Phishing Website Detection Using Machine Learning.
A Hybrid Approach For Phishing Website Detection Using Machine Learning.A Hybrid Approach For Phishing Website Detection Using Machine Learning.
A Hybrid Approach For Phishing Website Detection Using Machine Learning.vivatechijri
 
Integrated Web Recommendation Model with Improved Weighted Association Rule M...
Integrated Web Recommendation Model with Improved Weighted Association Rule M...Integrated Web Recommendation Model with Improved Weighted Association Rule M...
Integrated Web Recommendation Model with Improved Weighted Association Rule M...ijdkp
 
Survey on MapReduce in Big Data Clustering using Machine Learning Algorithms
Survey on MapReduce in Big Data Clustering using Machine Learning AlgorithmsSurvey on MapReduce in Big Data Clustering using Machine Learning Algorithms
Survey on MapReduce in Big Data Clustering using Machine Learning AlgorithmsIRJET Journal
 
FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM
FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM
FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM cscpconf
 
FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM
FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM
FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM csandit
 
A NEW IMPROVED WEIGHTED ASSOCIATION RULE MINING WITH DYNAMIC PROGRAMMING APPR...
A NEW IMPROVED WEIGHTED ASSOCIATION RULE MINING WITH DYNAMIC PROGRAMMING APPR...A NEW IMPROVED WEIGHTED ASSOCIATION RULE MINING WITH DYNAMIC PROGRAMMING APPR...
A NEW IMPROVED WEIGHTED ASSOCIATION RULE MINING WITH DYNAMIC PROGRAMMING APPR...cscpconf
 
The Challenges, Gaps and Future Trends: Network Security
The Challenges, Gaps and Future Trends: Network SecurityThe Challenges, Gaps and Future Trends: Network Security
The Challenges, Gaps and Future Trends: Network SecurityDeris Stiawan
 
WEB ATTACK PREDICTION USING STEPWISE CONDITIONAL PARAMETER TUNING IN MACHINE ...
WEB ATTACK PREDICTION USING STEPWISE CONDITIONAL PARAMETER TUNING IN MACHINE ...WEB ATTACK PREDICTION USING STEPWISE CONDITIONAL PARAMETER TUNING IN MACHINE ...
WEB ATTACK PREDICTION USING STEPWISE CONDITIONAL PARAMETER TUNING IN MACHINE ...IJCNCJournal
 
Web Attack Prediction using Stepwise Conditional Parameter Tuning in Machine ...
Web Attack Prediction using Stepwise Conditional Parameter Tuning in Machine ...Web Attack Prediction using Stepwise Conditional Parameter Tuning in Machine ...
Web Attack Prediction using Stepwise Conditional Parameter Tuning in Machine ...IJCNCJournal
 
Detecting Phishing Websites Using Machine Learning
Detecting Phishing Websites Using Machine LearningDetecting Phishing Websites Using Machine Learning
Detecting Phishing Websites Using Machine LearningIRJET Journal
 
Lei_Resume-it.doc
Lei_Resume-it.docLei_Resume-it.doc
Lei_Resume-it.docbutest
 
AN EXTENSIVE LITERATURE SURVEY ON COMPREHENSIVE RESEARCH ACTIVITIES OF WEB US...
AN EXTENSIVE LITERATURE SURVEY ON COMPREHENSIVE RESEARCH ACTIVITIES OF WEB US...AN EXTENSIVE LITERATURE SURVEY ON COMPREHENSIVE RESEARCH ACTIVITIES OF WEB US...
AN EXTENSIVE LITERATURE SURVEY ON COMPREHENSIVE RESEARCH ACTIVITIES OF WEB US...James Heller
 
STRATEGY AND IMPLEMENTATION OF WEB MINING TOOLS
STRATEGY AND IMPLEMENTATION OF WEB MINING TOOLSSTRATEGY AND IMPLEMENTATION OF WEB MINING TOOLS
STRATEGY AND IMPLEMENTATION OF WEB MINING TOOLSAM Publications
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Use of hog descriptors in phishing detection
Use of hog descriptors in phishing detectionUse of hog descriptors in phishing detection
Use of hog descriptors in phishing detectionSelman Bozkır
 

Ähnlich wie Classifying malicious websites using an ensemble weighted features (20)

A Review on Pattern Discovery Techniques of Web Usage Mining
A Review on Pattern Discovery Techniques of Web Usage MiningA Review on Pattern Discovery Techniques of Web Usage Mining
A Review on Pattern Discovery Techniques of Web Usage Mining
 
SMART Seminar Series: "From Big Data to Smart data"
SMART Seminar Series: "From Big Data to Smart data"SMART Seminar Series: "From Big Data to Smart data"
SMART Seminar Series: "From Big Data to Smart data"
 
ICMCSI 2023 PPT 1074.pptx
ICMCSI 2023 PPT 1074.pptxICMCSI 2023 PPT 1074.pptx
ICMCSI 2023 PPT 1074.pptx
 
Phishing Website Detection Using Machine Learning
Phishing Website Detection Using Machine LearningPhishing Website Detection Using Machine Learning
Phishing Website Detection Using Machine Learning
 
A Hybrid Approach For Phishing Website Detection Using Machine Learning.
A Hybrid Approach For Phishing Website Detection Using Machine Learning.A Hybrid Approach For Phishing Website Detection Using Machine Learning.
A Hybrid Approach For Phishing Website Detection Using Machine Learning.
 
Integrated Web Recommendation Model with Improved Weighted Association Rule M...
Integrated Web Recommendation Model with Improved Weighted Association Rule M...Integrated Web Recommendation Model with Improved Weighted Association Rule M...
Integrated Web Recommendation Model with Improved Weighted Association Rule M...
 
Survey on MapReduce in Big Data Clustering using Machine Learning Algorithms
Survey on MapReduce in Big Data Clustering using Machine Learning AlgorithmsSurvey on MapReduce in Big Data Clustering using Machine Learning Algorithms
Survey on MapReduce in Big Data Clustering using Machine Learning Algorithms
 
FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM
FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM
FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM
 
FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM
FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM
FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM
 
A NEW IMPROVED WEIGHTED ASSOCIATION RULE MINING WITH DYNAMIC PROGRAMMING APPR...
A NEW IMPROVED WEIGHTED ASSOCIATION RULE MINING WITH DYNAMIC PROGRAMMING APPR...A NEW IMPROVED WEIGHTED ASSOCIATION RULE MINING WITH DYNAMIC PROGRAMMING APPR...
A NEW IMPROVED WEIGHTED ASSOCIATION RULE MINING WITH DYNAMIC PROGRAMMING APPR...
 
The Challenges, Gaps and Future Trends: Network Security
The Challenges, Gaps and Future Trends: Network SecurityThe Challenges, Gaps and Future Trends: Network Security
The Challenges, Gaps and Future Trends: Network Security
 
WEB ATTACK PREDICTION USING STEPWISE CONDITIONAL PARAMETER TUNING IN MACHINE ...
WEB ATTACK PREDICTION USING STEPWISE CONDITIONAL PARAMETER TUNING IN MACHINE ...WEB ATTACK PREDICTION USING STEPWISE CONDITIONAL PARAMETER TUNING IN MACHINE ...
WEB ATTACK PREDICTION USING STEPWISE CONDITIONAL PARAMETER TUNING IN MACHINE ...
 
Web Attack Prediction using Stepwise Conditional Parameter Tuning in Machine ...
Web Attack Prediction using Stepwise Conditional Parameter Tuning in Machine ...Web Attack Prediction using Stepwise Conditional Parameter Tuning in Machine ...
Web Attack Prediction using Stepwise Conditional Parameter Tuning in Machine ...
 
Detecting Phishing Websites Using Machine Learning
Detecting Phishing Websites Using Machine LearningDetecting Phishing Websites Using Machine Learning
Detecting Phishing Websites Using Machine Learning
 
Lei_Resume-it.doc
Lei_Resume-it.docLei_Resume-it.doc
Lei_Resume-it.doc
 
Pf3426712675
Pf3426712675Pf3426712675
Pf3426712675
 
AN EXTENSIVE LITERATURE SURVEY ON COMPREHENSIVE RESEARCH ACTIVITIES OF WEB US...
AN EXTENSIVE LITERATURE SURVEY ON COMPREHENSIVE RESEARCH ACTIVITIES OF WEB US...AN EXTENSIVE LITERATURE SURVEY ON COMPREHENSIVE RESEARCH ACTIVITIES OF WEB US...
AN EXTENSIVE LITERATURE SURVEY ON COMPREHENSIVE RESEARCH ACTIVITIES OF WEB US...
 
STRATEGY AND IMPLEMENTATION OF WEB MINING TOOLS
STRATEGY AND IMPLEMENTATION OF WEB MINING TOOLSSTRATEGY AND IMPLEMENTATION OF WEB MINING TOOLS
STRATEGY AND IMPLEMENTATION OF WEB MINING TOOLS
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Use of hog descriptors in phishing detection
Use of hog descriptors in phishing detectionUse of hog descriptors in phishing detection
Use of hog descriptors in phishing detection
 

Kürzlich hochgeladen

Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 

Kürzlich hochgeladen (20)

Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 

Classifying malicious websites using an ensemble weighted features

  • 1. Detecting MaliciousWeb Pages Using An EnsembleWeighted Average Model - Research Project Presentation Dharmendra Lalji Vishwakarma X18108181 MSc in DataAnalytics – CohortA September 2018-19
  • 2. Area of Study & Motivation Increase in internet Users - Popularity of Cyber Crimes - Websites as a medium of attack Cyber-criminal activities such as ransomware, botnet, information stealing, and DDOS etc. - Leads to loss of Information privacy - Loss to the businesses 1 2
  • 3. Present Solutions – 1. Education & Legislation 2. Hand Crafted Techniques 1. Static Technique - Black-listing & White-listing Approach. 2. Dynamic Technique – Useful for creating blacklists 3. Intelligent Machine learning models – Using features present in the malicious webpage. 1. Recent case study – Keyword-density approach (Altay et al., 2018) 3
  • 4. Research Question How can weighted average ensemble of features set of keyword-density, URL features and JavaScript Code offer substantial improvements to keyword- density predictor in identifying malicious web pages?
  • 5. ResearchObjectives • Analysing the important attributes such as URL length for URL characteristics in distinguishing malicious class. • Reproducing the keyword-density methods of classifying webpages. It acts as a baseline model over an improved version of classification for the similar dataset. • Experimenting with each independent feature against the outcome to see their contribution in the prediction. • Dynamically calculating the weights for each feature set for classification using an ensemble weighted approach.
  • 6. Literature Review • Detection of malicious websites using URL features • (Chakraborty and Lin, 2017) and (Kim et al., 2018) • Malicious websites detection using JavaScript codes • (Liu et al., 2018) and (Stokes et al., 2018) • Using machine learning with a content-based approach • (Altay et al., 2018) and (Saxe et al., 2018) • Using Hybrid features approach • (Akiyama et al., 2017) and (Kazemian and Ahmed, 2015) • Review of Ensemble learning • (Nagaraj et al., 2018) and (Anne Ubing et al., 2019)
  • 7. Research Methodology • CRISP-DM Methodology (Wirth, 2000)
  • 8. Data Set Description • Sources: • Alexa – Benign Websites • PhishTank – Malicious Websites
  • 11. Features Extraction - HTML • Sklearn pipeline – TF-IDFVectoriser module • Takes care ofText processing such as tokenisation, stop word removal, stemming & n-grams.
  • 12. Final generated Data Set After Data Cleansing – Duplicates & Null values
  • 13. EDA-1
  • 14. EDA-2
  • 15. EDA-3
  • 17. Results Final Ensemble Results OptimalWeights – (2,3,2) – (URL, JS, KW)
  • 18. Comparison Results McNemar’sTest on contingency table - Statistically showed difference in developed models. - α = 0.05, p < 0.05
  • 19. Discussion • URL based models are proved to be a best classifier. • Dataset difference (2019) • Data extraction differences (Tools, Legal policies & Techniques)
  • 20. FutureWork • Browser plugins • More features can be added such as DNS, Server relations. • Combination of Static & Dynamic techniques. • Predicting more broader categories of classes. E.g. Threat Types.
  • 21. References • Altay, B., Dokeroglu, T. and Cosar, A. (2018). Context-sensitive and keyword density-based supervised machine learning techniques for malicious webpage detection, Soft Computing. • Chakraborty, G. and Lin, T. T. (2017). A url address aware classification of malicious websites for online security during web-surfing, 2017 IEEE International Conference on Advanced Networks and Telecommunications Systems (ANTS), pp. 1-6. • Kim, S., Kim, J., Nam, S. and Kim, D. (2018). Webmon: Ml- and yara-based malicious webpage detection, Computer Networks 137: 119-131. • Liu, J., Xu, M., Wang, X., Shen, S. and Li, M. (2018). A markov detection tree-based centralized scheme to automatically identify malicious webpages on cloud platforms, IEEE Access 6: 74025-74038. • Messabi, K. A., Aldwairi, M., Yousif, A. A., Thoban, A. and Belqasmi, F. (2018). Malware detection using dns records and domain name features, Proceedings of the 2Nd International Conference on Future Networks and Distributed Systems, ICFNDS '18, ACM, New York, NY, USA, pp. 29:1-29:7. • Saxe, J., Harang, R. E., Wild, C. and Sanders, H. (2018). A deep learning approach to fast,format-agnostic detection of malicious web content, CoRR abs/1804.05020. • Seifert, C., Welch, I., Komisarczuk, P., Aval, C. U. and Endicott-Popovsky, B. (2008). Identification of malicious web pages through analysis of underlying dns and web server relationships, 2008 33rd IEEE Conference on Local Computer Networks (LCN), pp. 935-941. • Stokes, J. W., Agrawal, R. and McDonald, G. (2018). Neural classification of malicious scripts: A study with javascript and vbscript, CoRR abs/1805.05603. • Wirth, R. (2000). Crisp-dm: Towards a standard process model for data mining, Proceedings of the Fourth International Conference on the Practical Application of Knowledge Discovery and Data Mining, pp. 29-39.

Hinweis der Redaktion

  1. Hello Everyone! My name is Dharmendra Vishwakarma. This is a presentation of the Research Project for Master’s in Data Analytics course. The research topic is on “Detecting malicious web pages using an ensemble weighted average model”.
  2. The area of my study is a mix of both in cyber security and data analytics domain. 1. With advancement in communication technologies and ever-increasing internet, most of the services are online nowadays such as e-banking, social networking, e-commerce and entertainment, etc. Due to the easy availability of services and information, users tend to browse the internet freely without knowing the negative side of it. These services are exploited by cyber attackers to steal useful and private user-sensitive information. 2. The cyber-attackers use websites as a medium to redirect users to their malicious network for further attacks or using drive-by-download software to install malware locally on the user’s computer. This enables attackers to perform other cyber-criminal activities such as ransomware, botnet, information stealing, and DDOS etc. These leads to loss of information privacy and many cases loss to the businesses.
  3. To solve this problem, there are primarily three categories of solutions are present. Firstly, users are given knowledge about the prevention techniques in the form of education and legislation through government initiative to discourage such activities. However, due to the busy nature of the business, people often tend to make a mistake in a real-world scenario. The second approach consists of preparing computerised hand-crafted techniques to prevent phishing activities. It usually involves static techniques such as blacklist and white-listing approach. A dynamic approach is used wherein a virtual sandbox environment is used to observe the behavior of web pages in order to detect the presence of deceptive nature. But this method is not ideal for real-time detection and can be employed for creating a blacklist of URLs. Lastly, intelligent machine learning models are used for solving this problem using features present in the website. Recent study using a keyword density-based approach for detecting malicious websites has shown significant accuracy. However, the content present on the page can not be a significant factor alone that contributes towards the deceptive nature of the website given that varying nature of the attack.
  4. So, research question for my proposal is “”
  5. And The specific objectives of this research is “” In this research proposal, there is a consideration of various other important factors along with the content-based approach. These factors are URL based features, DNS information, Server details, JavaScript codes present on the page. These factors can contribute to making the final decision as URL alone cannot efficiently detect phishing behaviour of the website. The main contribution of this research will be using an ensemble learning in deciding the final classification result using individual models.
  6. The literature review suggest following trends. Many authors have considered different features from malicious websites such as URL, DNS, JavaScript and page contents. All these previous researches considered different aspects of malicious threats to develop solutions. However, there is a need to develop a hybrid set of solutions which can detect malicious content even if one feature set fails to detect it. For instance, web threats can appear in many forms within the page such as XSS, phishing, a DDOS attack. The idea is to consider weighted impact on the final decision.
  7. The research methodology is based on the CRISP-DM which is a successful methodology for data mining projects. Therefore, each task of the research is majorly divided into 6 phases as per the CRISP-DM paradigm.
  8. The dataset for this research is as follows - 100 thousand benign URLs will be extracted from Alexa and 20 thousand malicious URLs will be downloaded from PhishTank. Both datasets have been previously used in the literature. Since a comparison will be made over baseline model, same dataset is considered.
  9. The dataset for this research is as follows - 100 thousand benign URLs will be extracted from Alexa and 20 thousand malicious URLs will be downloaded from PhishTank. Both datasets have been previously used in the literature. Since a comparison will be made over baseline model, same dataset is considered.
  10. The dataset for this research is as follows - 100 thousand benign URLs will be extracted from Alexa and 20 thousand malicious URLs will be downloaded from PhishTank. Both datasets have been previously used in the literature. Since a comparison will be made over baseline model, same dataset is considered.
  11. The dataset for this research is as follows - 100 thousand benign URLs will be extracted from Alexa and 20 thousand malicious URLs will be downloaded from PhishTank. Both datasets have been previously used in the literature. Since a comparison will be made over baseline model, same dataset is considered.
  12. The dataset for this research is as follows - 100 thousand benign URLs will be extracted from Alexa and 20 thousand malicious URLs will be downloaded from PhishTank. Both datasets have been previously used in the literature. Since a comparison will be made over baseline model, same dataset is considered.
  13. Box-plot for outlier detection - URL length shows outlier, further explored by classes
  14. # data is not normally distributed. # most of the data is right skewed
  15. #correleated attributes are detected # for example, cookies_ref_count related to setinterval time # rest all seems fine. and equally important for the model building
  16. The Implementation is as follows Web pages from the dataset is extracted and stored along with the URLs. The features related to keyword-density, URL, JavaScript code and DNS server relationships are extracted using feature extraction process. This features with class variable is supplied to individual machine learning models. Their outcome is given as input for weighted ensemble model. This way dynamic weights are be determined and trained model will be generated. Entire process is splitted into training and prediction. During prediction, unseen web pages is evaluated on predictive model. The evaluation is conducted using Precision, Recall, F1-score, Area Under the ROC curve and 10-fold cross validation. Furthermore, statistical test is carried out to check significance of model.
  17. Ensemble techniques has lowest error among other individual models.
  18. These are the references used in the presentation.
  19. Thank You.