SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Local Image Descriptor Based Phishing Web Page
Recognition as an Open-Set Problem
Ahmet Selman Bozkır, Esra Eroglu and Murat Aydos
Hacettepe University Department of Computer Engineering, TURKEY
Baskent University, Department of Management Sciences, TURKEY
Topics
• What is phishing?
• Types of phishing
• Facts about phishing
• Existing approaches
• Why vision based scheme?
• Proposed Method
• Details of Phish-IRIS dataset
• Experiments and results
• Conclusion
What is phishing?
• Phishing is a scamming activity which is based on
creating a visual illusion on innocent users by
providing fake web pages which mimic their
legitimate targets in order to steal valuable digital
data such as credit card information or e-mail
passwords.
Phone phreaking + fishing -> «phishing»
Types of Phishing
• Spear phishing (Person focused)
• Clone phishing (Bulk)
• Whaling
(Person/Institution Focused)
• Rogue WIFI (Mitm)
Facts and figures
* Source: PhishLabs 2016 Phishing Trends & Intelligence Report
Facts and figures
• In 2017, 700.ooo unique
phishing attacks have
been reported*
* Source: 2017 Quarter 3 Phishing Reports of APWG
Facts and figures
Average life time of phishing
pages is 32 hours
• Risk of zero-day attacks
getting higher due to not
being discovered by
blacklists
32h
* Source: APWG, Phishing activity trends paper. [Online].
Available at http://www/antiphishing.org/resources/apwg-papers/
Facts and figures
Consumer-oriented phishing
attacks targeted
• financial institutions
• cloud storage/file hosting sites
• webmail and online services
• ecommerce sites
• payment services.
90%
* Source: PhishLabs 2016 Phishing Trends & Intelligence Report
Facts and figures
• financial institutions
• payment services.
* Source: PhishLabs 2016 Phishing Trends & Intelligence Report
• cloud storage/file hosting sites
• cryptocurrency
Existing Anti-Phishing Approaches
Content & Blacklist
CANTINA [1]
SpoofGuard[2]
NetCraft [3]
DOM based
Medvet et al.[4]
Zhang et al. [5]
Fu et al. [6]
Vision based
Maurer et al.[7]
Verilog [8]
DeltaPhish [10]
Other
Chen et al.[9]
Why a vision based scheme?
• Substition of textual HTML elements with <IMG> or applet like rich
internet application (RIA) contents
• Zero day attacks need pro-active solutions
• Dynamic / AJAX type content loading
• Different DOM organizations between legitimate and target
phishing version.
• Robustness against complex backgrounds or page layouts
• And the most important is vision based solutions are in
concordance with human perception
Our Proposal:
Use of SIFT and DAISY descriptors in Bag of Visual Words Representation
Scale Invariant Feature Transform
• Lowe, 2004
• Local patch based key point driven
sampling
• Scale Invariance
• Rotation Invariance
• Robustness against illumination
Our Proposal:
Use of SIFT and DAISY descriptors in Bag of Visual Words Representation
DAISY Descriptor
• Tola et al., 2010
• Local Patch based dense sampling
• Scale Invariance
• Robustness against illumination
Architecture - Flow
Phish-IRIS Data Set
Publicly available at https://web.cs.hacettepe.edu.tr/~selman/phish-iris-dataset
Phish-IRIS Data Set
• Lack of a common phishing dataset tailored for vision based
antiphishing
• Based on real world observation and literature.
• 14 heavily phished target brands + legitimate samples
• Collected between March and May 2018, Phishtank + Openphish
• Open-set problem -> Collect “other” legitimate samples
• Distinct screenshots
• 1313 Training + 1539 Testing samples
• Screenshots were collected via a specially implemented Java based
wrapper equipped with Selenium Web Driver
Phish-IRIS Data Set
Brand Name Training Samples Testing Samples
Adobe 43 27
Alibaba 50 26
Amazon 18 11
Apple 49 15
Bank of America 81 35
Chase Bank 74 37
Dhl 67 42
Dropbox 75 40
Facebook 87 57
Linkedin 24 14
Microsoft 65 53
Paypal 121 93
Wellsfargo 89 45
Yahoo 70 44
Other 400 1000
Total 1313 1539
• Ratio of train/test for brands:
• 2/3 (roughly)
• Ratio of “unknown/other”:
• 4/10
Experiments
• We made variour experiments with 2 types of descriptors on
SVM, XGBoost and Random Forest algorithms
• We have tested different visual word counts such as 50, 100,
200 and 400 in order to understand whether sparsity or
weak/strong features affect the prediction quality
• Assessments have been carried out on test images via built
machine learning models trained on training images
• Evaluations were carried out regarding to True Positive Rate,
False Positive Rate and F1 measures
Results – 1: SIFT based prediction
Word count Algorithm Train acc Test acc TPR FPR F1
bov-50 svm 0.611 0.7732 0.7732 0.016 0.77
bov-50 xgboost 0.725 0.8187 0.8187 0.012 0.82
bov-50 random forest 0.729 0.842 0.842 0.112 0.84
bov-100 svm 0.674 0.803 0.803 0.014 0.80
bov-100 xgboost 0.762 0.846 0.8466 0.010 0.85
bov-100 random forest 0.749 0.860 0.860 0.0099 0.86
bov-200 svm 0.747 0.837 0.837 0.011 0.84
bov-200 xgboost 0.799 0.8589 0.8589 0.01 0.86
bov-200 random forest 0.774 0.8823 0.8823 0.0084 0.88
bov-400 svm 0.821 0.8758 0.875 0.008 0.88
bov-400 xgboost 0.827 0.8934 0.893 0.0076 0.89
bov-400 random forest 0.8 0.8875 0.8875 0.0080 0.89
Results – 2: DAISY based prediction
Word count Algorithm Train acc Test acc TPR FPR F1
bov-50 svm 0,648 0,7465 0,746 0,018 0,74
bov-50 xgboost 0,678 0,7849 0,784 0,015 0,78
bov-50 random forest 0,699 0,816 0,816 0,013 0,8
bov-100 svm 0,709 0,7758 0,775 0,016 0,77
bov-100 xgboost 0,709 0,7953 0,795 0,014 0,79
bov-100 random forest 0,715 0,8226 0,822 0,012 0,81
bov-200 svm 0,725 0,7901 0,79 0,014 0,79
bov-200 xgboost 0,722 0,8174 0,817 0,013 0,81
bov-200 random forest 0,719 0,831 0,831 0,012 0,82
bov-400 svm 0,725 0,818 0,818 0,818 0,81
bov-400 xgboost 0,725 0,8122 0,812 0,013 0,8
bov-400 random forest 0,716 0,8356 0,835 0,011 0,82
Comparison with (Bozkir et al. 2018*)
Method # of Features Learner TPR FPR F1
JCD 5040 random forest 0.891 0.131 0.886
FCTH 5760 random forest 0.895 0.114 0.891
CEDD 4320 random forest 0.895 0.137 0.89
CLD 360 random forest 0.878 0.146 0.873
SCD 3584 svm 0.906 0.085 0.905
Our best (Sift) 400 xgboost 0.893 0.0076 0.89
* Dalgic, F.C., Bozkir A.S., Aydos, M., “Phish-IRIS: A New Approach for Vision Based Brand Prediction of
Phishing Web Pages via Compact Visual Descriptors”, ISMSIT, Kızılcahamam, Ankara, 2018
Conclusion
• SIFT features based prediction outperforms the DAISY based recognition
• We have found that required time for computation of SIFT is less than DAISY
• One key finding we discovered is the importance of sampling strategy. Key point based
sampling yields a better result
• More the visual words we extract, more the accuracy we achieve. So, sparsity has not
been found as a problem for SIFT and DAISY. Inference takes only 0.32 seconds on a PC
equipped with Intel 8750 + 16 GB Memory
• Scalable Color Descriptor still surpasses SIFT in terms of TPR and FPR meaning that
color information is important as edge / contour / textures. However, SIFT achieves a
better false positive rate with much less number of features
• Consequently, SIFT is a suitable candidate for phishing web page detection/recognition
• Future work: Use of Deep Convolutional Neural Networks
References
1. Y. Zhang, J. Hong, L. Cranor, CANTINA: A Content-Based Approach to Detecting Phishing Web Sites, WWW 2007
2. Chou, N., R. Ledesma, Y. Teraguchi, D. Boneh, and J.C. Mitchell. Client-Side Defense against Web-Based Identity Theft.
In Proceedings of The 11th Annual Network and Distributed System Security Symposium (NDSS '04).
3. Netcraft, Netcraft Anti-Phishing Toolbar. Visited: April 20, 2016. http://toolbar.netcraft.com/
4. E. Medvet, E. Kirda and C. Krueger, Visual-Similarity-Based Phishing Detection, Securecomm ’08 International
Conference on Security and Privacy in Communication Networks, 2008
5. W. Zhang, H. Lu, B. Xu and H. Yang, Web Phishing Detection Based on Page Spatial Layout Similarity, Informatica, vol.
37, pp. 231-244, 2013.
6. A.Y. Fu, L. Wenyin and X. Deng, Detecting Phishing Web Pages with Visual Similarity Assesment based Earth
Mover’s Distance (EMD), IEEE Transactions on Dependable and Secure Computing, pp. 301-311, 2006.
7. M.E. Maurer and D. Herzner, Using visual website similarity for phishing detection and reporting, In CHI’12
Extended Abstacts on Human Factors in Computing Systems, 2012.
8. G. Wang, H. Liu, S. Becerra, K. Wang, Verilog: Proactive Phishing Detection via Logo Recognition, Technical
Report CS2011-0669, UC San Diego, 2011.
9. T. Chen, S. Dick, J. Miller, Detecting Visually Similar Web Pages: Application to Phishing Detection, ACM
Transactions on Internet and Technology, 10(2), 2010

Weitere ähnliche Inhalte

Mehr von Selman Bozkır (7)

Hopfield Ağı
Hopfield AğıHopfield Ağı
Hopfield Ağı
 
Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systems
 
SHOE (simple html ontology extensions)
SHOE (simple html ontology extensions)SHOE (simple html ontology extensions)
SHOE (simple html ontology extensions)
 
Predicting food demand in food courts by decision tree approaches
Predicting food demand in food courts by decision tree approachesPredicting food demand in food courts by decision tree approaches
Predicting food demand in food courts by decision tree approaches
 
Identification of User Patterns in Social Networks by Data Mining Techniques:...
Identification of User Patterns in Social Networks by Data Mining Techniques:...Identification of User Patterns in Social Networks by Data Mining Techniques:...
Identification of User Patterns in Social Networks by Data Mining Techniques:...
 
FUAT – A Fuzzy Clustering Analysis Tool
FUAT – A Fuzzy Clustering Analysis ToolFUAT – A Fuzzy Clustering Analysis Tool
FUAT – A Fuzzy Clustering Analysis Tool
 
Data mining & Decison Trees
Data mining & Decison TreesData mining & Decison Trees
Data mining & Decison Trees
 

Kürzlich hochgeladen

"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
mphochane1998
 
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
Health
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
HenryBriggs2
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
MayuraD1
 

Kürzlich hochgeladen (20)

A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
Bridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptxBridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptx
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects
 
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 

Local Image Descriptor Based Phishing Web Page Recognition as an Open-Set Problem

  • 1. Local Image Descriptor Based Phishing Web Page Recognition as an Open-Set Problem Ahmet Selman Bozkır, Esra Eroglu and Murat Aydos Hacettepe University Department of Computer Engineering, TURKEY Baskent University, Department of Management Sciences, TURKEY
  • 2. Topics • What is phishing? • Types of phishing • Facts about phishing • Existing approaches • Why vision based scheme? • Proposed Method • Details of Phish-IRIS dataset • Experiments and results • Conclusion
  • 3. What is phishing? • Phishing is a scamming activity which is based on creating a visual illusion on innocent users by providing fake web pages which mimic their legitimate targets in order to steal valuable digital data such as credit card information or e-mail passwords. Phone phreaking + fishing -> «phishing»
  • 4. Types of Phishing • Spear phishing (Person focused) • Clone phishing (Bulk) • Whaling (Person/Institution Focused) • Rogue WIFI (Mitm)
  • 5. Facts and figures * Source: PhishLabs 2016 Phishing Trends & Intelligence Report
  • 6. Facts and figures • In 2017, 700.ooo unique phishing attacks have been reported* * Source: 2017 Quarter 3 Phishing Reports of APWG
  • 7. Facts and figures Average life time of phishing pages is 32 hours • Risk of zero-day attacks getting higher due to not being discovered by blacklists 32h * Source: APWG, Phishing activity trends paper. [Online]. Available at http://www/antiphishing.org/resources/apwg-papers/
  • 8. Facts and figures Consumer-oriented phishing attacks targeted • financial institutions • cloud storage/file hosting sites • webmail and online services • ecommerce sites • payment services. 90% * Source: PhishLabs 2016 Phishing Trends & Intelligence Report
  • 9. Facts and figures • financial institutions • payment services. * Source: PhishLabs 2016 Phishing Trends & Intelligence Report • cloud storage/file hosting sites • cryptocurrency
  • 10. Existing Anti-Phishing Approaches Content & Blacklist CANTINA [1] SpoofGuard[2] NetCraft [3] DOM based Medvet et al.[4] Zhang et al. [5] Fu et al. [6] Vision based Maurer et al.[7] Verilog [8] DeltaPhish [10] Other Chen et al.[9]
  • 11. Why a vision based scheme? • Substition of textual HTML elements with <IMG> or applet like rich internet application (RIA) contents • Zero day attacks need pro-active solutions • Dynamic / AJAX type content loading • Different DOM organizations between legitimate and target phishing version. • Robustness against complex backgrounds or page layouts • And the most important is vision based solutions are in concordance with human perception
  • 12. Our Proposal: Use of SIFT and DAISY descriptors in Bag of Visual Words Representation Scale Invariant Feature Transform • Lowe, 2004 • Local patch based key point driven sampling • Scale Invariance • Rotation Invariance • Robustness against illumination
  • 13. Our Proposal: Use of SIFT and DAISY descriptors in Bag of Visual Words Representation DAISY Descriptor • Tola et al., 2010 • Local Patch based dense sampling • Scale Invariance • Robustness against illumination
  • 15. Phish-IRIS Data Set Publicly available at https://web.cs.hacettepe.edu.tr/~selman/phish-iris-dataset
  • 16. Phish-IRIS Data Set • Lack of a common phishing dataset tailored for vision based antiphishing • Based on real world observation and literature. • 14 heavily phished target brands + legitimate samples • Collected between March and May 2018, Phishtank + Openphish • Open-set problem -> Collect “other” legitimate samples • Distinct screenshots • 1313 Training + 1539 Testing samples • Screenshots were collected via a specially implemented Java based wrapper equipped with Selenium Web Driver
  • 17. Phish-IRIS Data Set Brand Name Training Samples Testing Samples Adobe 43 27 Alibaba 50 26 Amazon 18 11 Apple 49 15 Bank of America 81 35 Chase Bank 74 37 Dhl 67 42 Dropbox 75 40 Facebook 87 57 Linkedin 24 14 Microsoft 65 53 Paypal 121 93 Wellsfargo 89 45 Yahoo 70 44 Other 400 1000 Total 1313 1539 • Ratio of train/test for brands: • 2/3 (roughly) • Ratio of “unknown/other”: • 4/10
  • 18. Experiments • We made variour experiments with 2 types of descriptors on SVM, XGBoost and Random Forest algorithms • We have tested different visual word counts such as 50, 100, 200 and 400 in order to understand whether sparsity or weak/strong features affect the prediction quality • Assessments have been carried out on test images via built machine learning models trained on training images • Evaluations were carried out regarding to True Positive Rate, False Positive Rate and F1 measures
  • 19. Results – 1: SIFT based prediction Word count Algorithm Train acc Test acc TPR FPR F1 bov-50 svm 0.611 0.7732 0.7732 0.016 0.77 bov-50 xgboost 0.725 0.8187 0.8187 0.012 0.82 bov-50 random forest 0.729 0.842 0.842 0.112 0.84 bov-100 svm 0.674 0.803 0.803 0.014 0.80 bov-100 xgboost 0.762 0.846 0.8466 0.010 0.85 bov-100 random forest 0.749 0.860 0.860 0.0099 0.86 bov-200 svm 0.747 0.837 0.837 0.011 0.84 bov-200 xgboost 0.799 0.8589 0.8589 0.01 0.86 bov-200 random forest 0.774 0.8823 0.8823 0.0084 0.88 bov-400 svm 0.821 0.8758 0.875 0.008 0.88 bov-400 xgboost 0.827 0.8934 0.893 0.0076 0.89 bov-400 random forest 0.8 0.8875 0.8875 0.0080 0.89
  • 20. Results – 2: DAISY based prediction Word count Algorithm Train acc Test acc TPR FPR F1 bov-50 svm 0,648 0,7465 0,746 0,018 0,74 bov-50 xgboost 0,678 0,7849 0,784 0,015 0,78 bov-50 random forest 0,699 0,816 0,816 0,013 0,8 bov-100 svm 0,709 0,7758 0,775 0,016 0,77 bov-100 xgboost 0,709 0,7953 0,795 0,014 0,79 bov-100 random forest 0,715 0,8226 0,822 0,012 0,81 bov-200 svm 0,725 0,7901 0,79 0,014 0,79 bov-200 xgboost 0,722 0,8174 0,817 0,013 0,81 bov-200 random forest 0,719 0,831 0,831 0,012 0,82 bov-400 svm 0,725 0,818 0,818 0,818 0,81 bov-400 xgboost 0,725 0,8122 0,812 0,013 0,8 bov-400 random forest 0,716 0,8356 0,835 0,011 0,82
  • 21. Comparison with (Bozkir et al. 2018*) Method # of Features Learner TPR FPR F1 JCD 5040 random forest 0.891 0.131 0.886 FCTH 5760 random forest 0.895 0.114 0.891 CEDD 4320 random forest 0.895 0.137 0.89 CLD 360 random forest 0.878 0.146 0.873 SCD 3584 svm 0.906 0.085 0.905 Our best (Sift) 400 xgboost 0.893 0.0076 0.89 * Dalgic, F.C., Bozkir A.S., Aydos, M., “Phish-IRIS: A New Approach for Vision Based Brand Prediction of Phishing Web Pages via Compact Visual Descriptors”, ISMSIT, Kızılcahamam, Ankara, 2018
  • 22. Conclusion • SIFT features based prediction outperforms the DAISY based recognition • We have found that required time for computation of SIFT is less than DAISY • One key finding we discovered is the importance of sampling strategy. Key point based sampling yields a better result • More the visual words we extract, more the accuracy we achieve. So, sparsity has not been found as a problem for SIFT and DAISY. Inference takes only 0.32 seconds on a PC equipped with Intel 8750 + 16 GB Memory • Scalable Color Descriptor still surpasses SIFT in terms of TPR and FPR meaning that color information is important as edge / contour / textures. However, SIFT achieves a better false positive rate with much less number of features • Consequently, SIFT is a suitable candidate for phishing web page detection/recognition • Future work: Use of Deep Convolutional Neural Networks
  • 23. References 1. Y. Zhang, J. Hong, L. Cranor, CANTINA: A Content-Based Approach to Detecting Phishing Web Sites, WWW 2007 2. Chou, N., R. Ledesma, Y. Teraguchi, D. Boneh, and J.C. Mitchell. Client-Side Defense against Web-Based Identity Theft. In Proceedings of The 11th Annual Network and Distributed System Security Symposium (NDSS '04). 3. Netcraft, Netcraft Anti-Phishing Toolbar. Visited: April 20, 2016. http://toolbar.netcraft.com/ 4. E. Medvet, E. Kirda and C. Krueger, Visual-Similarity-Based Phishing Detection, Securecomm ’08 International Conference on Security and Privacy in Communication Networks, 2008 5. W. Zhang, H. Lu, B. Xu and H. Yang, Web Phishing Detection Based on Page Spatial Layout Similarity, Informatica, vol. 37, pp. 231-244, 2013. 6. A.Y. Fu, L. Wenyin and X. Deng, Detecting Phishing Web Pages with Visual Similarity Assesment based Earth Mover’s Distance (EMD), IEEE Transactions on Dependable and Secure Computing, pp. 301-311, 2006. 7. M.E. Maurer and D. Herzner, Using visual website similarity for phishing detection and reporting, In CHI’12 Extended Abstacts on Human Factors in Computing Systems, 2012. 8. G. Wang, H. Liu, S. Becerra, K. Wang, Verilog: Proactive Phishing Detection via Logo Recognition, Technical Report CS2011-0669, UC San Diego, 2011. 9. T. Chen, S. Dick, J. Miller, Detecting Visually Similar Web Pages: Application to Phishing Detection, ACM Transactions on Internet and Technology, 10(2), 2010