Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Behind the Scenes in
Building Data Products
From Data Science to Data Products
Experiences in Information Security
Alejand...
Who am I?
Chief Data Scientist at Easy Solutions
Industrial Engineer
PhD in Machine Learning from Luxembourg University
Sc...
AboutEasySolutions®
3
A leading global provider of electronic fraud
prevention for financial institutions and
enterprise c...
Aims of this talk
Discuss what makes a data science project successful
4
What is Data Science?
5
What is Data Science?
6
IsDataScienceallhypeandno substance?
7
What is Data
Science?
8
9
Those are the pillars of data science: computing,
statistics, mathematics, and quantitative
disciplines, combined to analy...
Hacking Skills
Ability to build things and find clever solutions to
problems
• Programming/Coding: Python and R (and other...
Hacking Skills
12
Math & Statistics
Being able understand the right solution to each
problem
• Linear algebra: Matrix manipulation
• Machine...
Substantive Expertise
Ability to ask good questions requires domain
understanding, that’s why a data scientist can’t creat...
Building Data Products
In Information Security
15
Research/ DataScienceSpectrum
16
• Maybe someday, someone can use this
Basic
Research
• I might be able to use this
Applie...
TotalFraudProtection
Discuss what makes a data science project successful
17
18
Risk Based Authentication Phishing URL Classification Phishing Brand ID
Fraud Detection
19h risk = 10
9h risk = 95
HTML...
19
Phishing
TypicalPhishingExample
20
WhyPhishingDetectionisHard
21
Original Website Only Using Images Subtle Changes
Is It Phishing?
IdealPhishingDetectionSystem
23
Machine
Learning
Algorithm
IdealPhishingDetectionSystem - Issues
Issues with full content
analysis:
• Time consuming
• Impractical to process
million...
ThereisalwaystheneedforanURL
25
Phishing URL
Classification
26
Databaseof URLs
27
1,000,000 Phishing URLs from PhishTank
http://moviesjingle.com/auto/163.com/index.php
1,000,000 Legitim...
URLLexicalandStatisticalFrequencies
28
http://www.secure.paypal.com.papaya.com/secure_login.php
URL length Alexa
Ranking
P...
URLLexicalandStatisticalFrequencies
Results:
29
3-Fold CV Accuracy Recall Precision
Average 93.47% 93.28% 93.64%
Deviation...
URLLexicalandStatisticalFrequencies
Feature
Importance
30
MODELING PHISHING
URLS WITH RECURRENT
NEURAL NETWORKS
31
32
RecurrentNeuralNetworksRNN
Haveloops!
33
Short term dependencies are easy
long term …
TheProblemofLong-TermDependencies
34
RNN contains
a single layer
LSTM contains
four interacting
layers
Source: http://colah.github.io/posts/2015-08-Understa...
Long-ShortTermMemoryNetworks
35
URL
h
t
t
p
:
/
/
w
w
w
.
p
a
p
a
y
a
.
c
o
m
One hot
Encoding
…
…
…
…
…
…
…
…
…
…
…
…
…
…...
ModelsComparison
36
Model
Random Forest
Long-Short Term
Memory Network
Memory
Consumption
(MB)
289
0.56
Evaluation
Time (U...
37
38
1. Lets build Swordphish
3. Random
Forest Classifier2. Data Collection
4. API
5. Product Evaluation
6. Recurrent
Neural...
39
Identifying
Targeted Brand
40
BrandID- Scope
Given a phishing attack, determine the targeted bank
BrandID- Scope
• Create a learning ML engine to label attacks
against any brand
• Not limited to current customers or know...
BrandID- Issues
1000’s of phishing attacks per hour detected worldwide
BrandID–Issues
Target is not always straightforward
BrandID–Issues
Target is not always straightforward
BrandIDArchitecture
1. Get Phishing Site Info
2. Analyze Images
3. Analyze Text
4. HTML Structure
5. Machine Learning
Clas...
1. GetPhishingSiteInfo
1. Screenshot
2. Page Text
3. HTML Code
47
1. GetPhishingSiteInfo
48
1. GetPhishingSiteInfo
Splash takes 5s to render one URL
BrandID receives 33,000 URLs per day
It would take 4.6 days to pr...
2. AnalyzeImages
50
Transfer Learning and Siamese Networks
Main idea: find a function that maps input patterns into
a target space such that a...
3. AnalyzeText
OCR with Leptonica
52
4. HTMLStructureandWhoIS
53
5. MachineLearningClassifier
54
Training Label
False
False
True
False
Training
5. MachineLearningClassifier
55
Predicted Prob
0.96
0.12
0.05
0.09
Predicting
Results
56
0.59
0.99
Results–2
57
0.94
0.80
BrandIDArchitecture
1. Get Phishing Site Info
2. Analyze Images
3. Analyze Text
4. HTML Structure
5. Machine Learning
Clas...
59
Business Case
Random Forest
Classifier
Data Collection
Product Evaluation
Image
Analysis
Distributed API
Splash JS
10 %...
At the end of the day,
there is much more to
Data Products than just
Machine Learning.
60
Any questions or comments, please let me know.
Alejandro Correa Bahnsen, PhD
Chief Data Scientist & Head of Research
acorr...
How I Learned to Stop Worrying and Love Building Data Products
Nächste SlideShare
Wird geladen in …5
×

How I Learned to Stop Worrying and Love Building Data Products

457 Aufrufe

Veröffentlicht am

Most people think a successful data product requires just three things: data, the
right algorithm, and good execution. But as anyone who’s tried to create one
knows, an effective product requires much more. In this talk, Dr. Correa Bahnsen
will share his successes—and failures—in building data products for information
security, and why an isolated data science team is a recipe for failure.

Veröffentlicht in: Daten & Analysen
  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

How I Learned to Stop Worrying and Love Building Data Products

  1. 1. Behind the Scenes in Building Data Products From Data Science to Data Products Experiences in Information Security Alejandro Correa Bahnsen, PhD Chief Data Scientist & Head of Research acorrea@easysol.net
  2. 2. Who am I? Chief Data Scientist at Easy Solutions Industrial Engineer PhD in Machine Learning from Luxembourg University Scikit-Learn contributor Organizer of Science Bogota Meetups 2
  3. 3. AboutEasySolutions® 3 A leading global provider of electronic fraud prevention for financial institutions and enterprise customers 430+ customers In 30 countries 115 million Users protected 30 billion Online connections monitored Industry recognition
  4. 4. Aims of this talk Discuss what makes a data science project successful 4
  5. 5. What is Data Science? 5
  6. 6. What is Data Science? 6
  7. 7. IsDataScienceallhypeandno substance? 7
  8. 8. What is Data Science? 8
  9. 9. 9
  10. 10. Those are the pillars of data science: computing, statistics, mathematics, and quantitative disciplines, combined to analyze data for better decision making 10 DataScienceIstheIntersectionofHacking Skills,Math&StatisticsKnowledgeand SubstantiveExpertise
  11. 11. Hacking Skills Ability to build things and find clever solutions to problems • Programming/Coding: Python and R (and others) • Databases: MySQL, PostgreSQL, Cassandra, MongoDB and CouchDB. • Visualization: D3, Tableau, Qlikview and Markdown. • Big Data: Hadoop, MapReduce and Spark. 11
  12. 12. Hacking Skills 12
  13. 13. Math & Statistics Being able understand the right solution to each problem • Linear algebra: Matrix manipulation • Machine Learning: Random Forests, SVM, Boosting • Descriptive statistics: Describe, Cluster • Statistical inference: Generate new knowledge 13
  14. 14. Substantive Expertise Ability to ask good questions requires domain understanding, that’s why a data scientist can’t create data based solutions without a good industry knowledge • Is this A or B or C? (classification) • Is this weird? (anomaly detection) • How much/how many? (regression) • How is it organized? (clustering) • What should I do next? (reinforcement learning) 14
  15. 15. Building Data Products In Information Security 15
  16. 16. Research/ DataScienceSpectrum 16 • Maybe someday, someone can use this Basic Research • I might be able to use this Applied Research • I can use this (sometimes) Working Prototype • Software engineers can use thisQuality Code • People can use this Tool or Service Innovation practicality
  17. 17. TotalFraudProtection Discuss what makes a data science project successful 17
  18. 18. 18 Risk Based Authentication Phishing URL Classification Phishing Brand ID Fraud Detection 19h risk = 10 9h risk = 95 HTML Injection Biometrics
  19. 19. 19 Phishing
  20. 20. TypicalPhishingExample 20
  21. 21. WhyPhishingDetectionisHard 21 Original Website Only Using Images Subtle Changes
  22. 22. Is It Phishing? IdealPhishingDetectionSystem 23 Machine Learning Algorithm
  23. 23. IdealPhishingDetectionSystem - Issues Issues with full content analysis: • Time consuming • Impractical to process millions of websites per day • Hard to implement for small devices 24
  24. 24. ThereisalwaystheneedforanURL 25
  25. 25. Phishing URL Classification 26
  26. 26. Databaseof URLs 27 1,000,000 Phishing URLs from PhishTank http://moviesjingle.com/auto/163.com/index.php 1,000,000 Legitimate URLs from Common Crawl http://paypal.com.update.account.toughbook.cl/8a30e847925afc597516 1aeabe8930f1/?cmd=_home&dispatch=d09b78f5812945a73610edf38 http://msystemtech.ru/components/com_users/Italy/zz/Login.php?run= _login-submit&session=68bbd43c854147324d77872062349924 https://www.sanfordhealth.org/ChildrensHealth/Article/73980 http://www.grahamleader.com/ci_25029538/these-are-5-worst-super- bowl-halftime-shows&defid=1634182 http://www.carolinaguesthouse.co.uk/onlinebooking/?industrytype=1& startdate=2013-09-05&nights=2&location&productid=25d47a24-6b74
  27. 27. URLLexicalandStatisticalFrequencies 28 http://www.secure.paypal.com.papaya.com/secure_login.php URL length Alexa Ranking Path length URL Entropy # of .com Punctuation count TLD count Is IP? Euclidean distance KS & KL distance Is It Phishing?
  28. 28. URLLexicalandStatisticalFrequencies Results: 29 3-Fold CV Accuracy Recall Precision Average 93.47% 93.28% 93.64% Deviation 0.01% 0.02% 0.03%
  29. 29. URLLexicalandStatisticalFrequencies Feature Importance 30
  30. 30. MODELING PHISHING URLS WITH RECURRENT NEURAL NETWORKS 31
  31. 31. 32 RecurrentNeuralNetworksRNN Haveloops!
  32. 32. 33 Short term dependencies are easy long term … TheProblemofLong-TermDependencies
  33. 33. 34 RNN contains a single layer LSTM contains four interacting layers Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/ Long-ShortTermMemoryNetworksLSTM
  34. 34. Long-ShortTermMemoryNetworks 35 URL h t t p : / / w w w . p a p a y a . c o m One hot Encoding … … … … … … … … … … … … … … … … … … … … … Embedding 3.2 1.2 … 1.7 6.4 2.3 … 2.6 6.4 3.0 … 1.7 3.4 2.6 … 3.4 2.6 3.8 … 2.6 3.5 3.2 … 6.4 1.7 4.2 … 6.4 8.6 2.4 … 6.4 4.3 2.9 … 6.4 2.2 3.4 … 3.4 3.2 2.6 … 2.6 4.2 2.2 … 3.5 2.4 3.2 … 1.7 2.9 1.7 … 8.6 3.0 6.4 … 2.6 2.6 6.4 … 3.8 3.8 3.4 … 3.2 3.3 2.6 … 2.2 3.1 2.2 … 2.9 1.8 3.2 … 3.0 2.5 6.4 … 2.6 LSTM LSTM LSTM LSTM Sigmoid …
  35. 35. ModelsComparison 36 Model Random Forest Long-Short Term Memory Network Memory Consumption (MB) 289 0.56 Evaluation Time (URLs per sec) 942 281 Training Time (minutes) 2.95 238.7 Accuracy 93.7% 98.7%
  36. 36. 37
  37. 37. 38 1. Lets build Swordphish 3. Random Forest Classifier2. Data Collection 4. API 5. Product Evaluation 6. Recurrent Neural Networks 7. Distributed API 8. Port to C++ 9. Sales & Marketing 30 % 50 % 20 % Total Effort
  38. 38. 39
  39. 39. Identifying Targeted Brand 40
  40. 40. BrandID- Scope Given a phishing attack, determine the targeted bank
  41. 41. BrandID- Scope • Create a learning ML engine to label attacks against any brand • Not limited to current customers or known layouts • Apply ML techniques to extract knowledge • Enhance predictive capabilities
  42. 42. BrandID- Issues 1000’s of phishing attacks per hour detected worldwide
  43. 43. BrandID–Issues Target is not always straightforward
  44. 44. BrandID–Issues Target is not always straightforward
  45. 45. BrandIDArchitecture 1. Get Phishing Site Info 2. Analyze Images 3. Analyze Text 4. HTML Structure 5. Machine Learning Classifier 46
  46. 46. 1. GetPhishingSiteInfo 1. Screenshot 2. Page Text 3. HTML Code 47
  47. 47. 1. GetPhishingSiteInfo 48
  48. 48. 1. GetPhishingSiteInfo Splash takes 5s to render one URL BrandID receives 33,000 URLs per day It would take 4.6 days to process one day of URLs It’s expected to grow up to 1,000,000 49
  49. 49. 2. AnalyzeImages 50
  50. 50. Transfer Learning and Siamese Networks Main idea: find a function that maps input patterns into a target space such that a simple distance in the target space (say the Euclidean distance) approximates the “semantic” distance in the input space .84 2. AnalyzeImages
  51. 51. 3. AnalyzeText OCR with Leptonica 52
  52. 52. 4. HTMLStructureandWhoIS 53
  53. 53. 5. MachineLearningClassifier 54 Training Label False False True False Training
  54. 54. 5. MachineLearningClassifier 55 Predicted Prob 0.96 0.12 0.05 0.09 Predicting
  55. 55. Results 56 0.59 0.99
  56. 56. Results–2 57 0.94 0.80
  57. 57. BrandIDArchitecture 1. Get Phishing Site Info 2. Analyze Images 3. Analyze Text 4. HTML Structure 5. Machine Learning Classifier 58
  58. 58. 59 Business Case Random Forest Classifier Data Collection Product Evaluation Image Analysis Distributed API Splash JS 10 % 50 % 40 % Total Effort NLP Spark AKKA Transfer Learning HTML Analysis
  59. 59. At the end of the day, there is much more to Data Products than just Machine Learning. 60
  60. 60. Any questions or comments, please let me know. Alejandro Correa Bahnsen, PhD Chief Data Scientist & Head of Research acorrea@easysol.net Thank you!

×