SlideShare ist ein Scribd-Unternehmen logo
1 von 142
Downloaden Sie, um offline zu lesen
Avast @ Machine Learning
Prague - HANDOUT
S in IoT stands for Security
Our team: IoT and ML Research department
2
Galina Adam Tomáš Marek
Martin Vláďa Martin
Avast in numbers
Around 2000 employees
435 million users
Users in over 150 countries
Protecting from 3.5 billion attacks per month
Blocked 128 million ransomware attacks in 2016
Our engines check 200 billion URLs and 300 million new files monthly
3
Present: Traditional Avast Business
4
• Number of IoT devices is on the rise, expected to
have 75 billion of connected things by 2025
• IP Cameras
• Network attached storages
• Thermostats
• Smart speakers
• Digital personal assistants
• … you name it, the IoT world will have it
Future: The IoT world
5
• IoT products:
• convenience
• usability
• not necessarily to be easily secured
• They can be compromised in many ways
• Spy on users
• Blackmail users
• Gain physical access to the home
• Misuse of devices for
• Attacking third-party services
• Misuse of computational power
Future: Securing the IoT world
6
• Avast is developing a AI-based protection for IoT
• The key to protecting IoT devices:
Cloud-based security that monitors for
threats at the network level
Avast Smart Life - a new product coming
7
09:00 - 09:30 Yin and Yang of IoT
09:30 - 10:00 A Case study: Mirai attack vector
10:00 - 10:30 Machine learning algorithms and feature engineering
10:30 - 11:00 Coffee Break
11:00 - 11:30 Neural networks for classification of binary files
11:30 - 12:00 Identifying devices within the network
12:00 - 12:30 Phishing prevention and blocking of malicious URLs
Workshop structure
8
IoT Botnets
Adam Hanka
9
Botnet
Set of enslaved devices (usually IoT) that can be
controlled by a cybercriminal
OR
Malware that enslaves the devices
10
But why?
Can be used to
• Gain computational power
• Perform distributed denial-of-service attack (DDoS attack)
• Send spam
• Mine Cryptocurrencies
• Steal data
DDoS Business model
• Harm&Destroy your competitors and opponents
• Business competitors
• Political opponents, Independent journalism
• Sell DDoS as a service
• Blackmail companies (Money or DDoS!)
11
Botnet components
• Zombie computer
• A compromised node
infected by the botnet
malware
• Command and Control (CnC)
server
• Server that remotely controls
the zombie computers
• Botmaster
• A person who operates the
CnC server
• Hides their identity (via Tor,
proxies, …)
12
Architectures
Client-Server Peer to peer
13
A case study: Mirai
• Attacks vulnerable IoT devices with factory-default credentials
• IP cameras
• Network storages
• Client - Server architecture
• Two main component are the Mirai itself and a C&C server
• Both available on Github
• Spreads like a worm over the internet: each infected node scans the whole IPv4
• Does not attack DoD of USA
• Not persistent - device restarts and mirai disappears (and comes soon again -
C&C has memory)
14
A case study: Mirai
• October 2016: One of the most impactful cyber attacks ever against online
infrastructure firm Dyn impacting Twitter, Spotify, Reddit, Airbnb, Netflix,...
• Also hit and disabled krebsonsecurity.com just hours after Krebs presented a talk on
Mirai at a conference
• Solution: Google’s Project Shield protection.
• Attacks with power of 600 - 1500 Gb/s
• 150 000 enslaved devices make for 1Tb of DDoS capability
15
Mirai: Vulnerable device setup
• Telnet running (big mistake, but still very popular)
• Forwards its port 23 to the router (i.e. port 23 is visible from the outside of the network)
16
Mirai: Dictionary Attack
• Knock knock, will you let me in?
• Mirai has a predefined dictionary of factory-setting credentials, tries them randomly
• Sends the commands via telnet (plaintext)
•
17
Mirai: Dictionary Attack
• Knock knock, will you let me in?
• Mirai has a predefined dictionary of factory-setting credentials, tries them randomly
• Sends the commands via telnet (plaintext)
18
Mirai: Password guessed
It notifies the C&C server which sends a telnet command to download the mirai binary (via
wget or tftp)
19
Mirai: Mass Scan
• Scans the internet and tries to infect vulnerable devices in other networks
• Mirai is quite a stupid parasite: even sometimes kills its host
20
Mirai: DDoS
When the command comes, this is the result:
21
A map of internet outages in Europe and North America caused by the Dyn
cyberattack (as of 21 October 2016 1:45pm Pacific Time).
Mirai: Visualization of Activity
Telnet & Mass Scan
22
A case study: Mirai attack vector
Summary of Mirai attack:
• Uses vulnerable telnet
• Has a list of factory settings (CHANGE YOUR PASSWORD)
• Scans the whole range of the internet
• Is actually very simple, nonetheless, it caused a lot of troubles in 2016
IoT Malware is very simple compared to PC malware, it will be evolving
and gain complexity
23
ML for security
in the Networking context
Martin Bálek
Security of an endpoint
• Block malicious servers / sites
• Check traffic
• Detect malicious content
25
26
Networking
Our data domain
• Stream of packets
• Big amount of data
• A lot of protocols (tcp, udp, ip, icmp, upnp, telnet, ftp, http, https…)
• Deep vs. stateful packet inspection
27
28
Our data domain
• Flows
• Source IP address
• Source Port
• Destination IP address
• Destination Port
• Protocol
• + some useful stats
29
Visualizations
30
31
• Detect devices which are present in network
• Find communication patterns
• Detect malicious behavior
• Make all above robust
Challenges
Data from ML point of view
• Features
• Are available at any time? (e.g. total amount of data)
• How cheap is to calculate them? (e.g. packet interarrival time - mean, variance based)
• Is the information relevant? (e.g. port)
• Time series... but begin and end are possibly more important
• Single packet/flow is never important (or is it?)
• Traffic is not deterministic (e.g. connection issues, port or IP changes a lot)
• All mixed together
• A lot of legitimate scenarios (e.g. torrent vs. mass scan of the network)
• Are we always sure what we are modeling?
32
ZOO of algorithms
• Unsupervised techniques
• Anomaly detection
• Communication patterns detection
• Semisupervised methods
• Device identification
• Classification
• Content checking
• Patterns classification
33
34
Device identification
Galina Alperovich
Task description
● Device identification: automatically identify device type and device model based on
available device networking information
35
36
Why this task is important
● “Security in IoT” expects ability to distinguish IoT devices from other devices =>
device identification
● Device type and device model are important features in malware detection
37
Existing approach
● Expert-created rules based on device features and regexps
● Advantages:
○ Utilize expert knowledge and many different features
○ Accurate upon exact match
● Disadvantages:
○ Missclassifications for too broad rules
○ Conflicting rules
○ Unknown accuracy
38
How we want to improve it
● More accurate model
● Can solve conflicts between “rules” automatically
● Ability to generalise
● Ability to tune every source of properties and measure the accuracy
● Level of confidence (probability) together with an answer
39
Device identification as classification
● Classification task where classes = different device types (~20 classes like phone,
security camera, printer, computer, bulb, fridge)
● More detailed task: classes = different models of the device class (thousands of
classes)
● Features:
○ Scan features (MAC-address, open ports, text body of specific responses)
○ Behavioral features (patterns in traffic consumption)
40
Example of data
ports MAC Vendor Protocol Response Label
[ 67, 1900,
2869 ]
4c:0b:be:43:19:28 Microsoft DHCP hostname: XboxOnenclassid:
MSFT 5.0nparamlist:
1,3,6,15,31,33,43,44,46,47,121,
249,252
Game
console
[ 80, 443,
1900]
e0:88:5d:8f:f9:11 Technico UPNP
'<?xml version="1.0"?>n<root
xmlns="urn:schemas-upnp-org:device-1-0">n<specVersion>n<majo
r>1</major>n<minor>0</minor>n</specVersion>n<device>n<devic
eType>urn:schemas-upnp-org:device:InternetGatewayDevice:1</devi
ceType>n<friendlyName>ZyXEL Keenetic
II</friendlyName>n<manufacturer>ZyXEL Communications
Corp.</manufacturer>n<manufacturerURL>ht……….
Router
[ 88, 443,
554, 1900,
5353 ]
00:0e:53:15:1d:ab AvTech HTTPS
'<html>n<head>n<link rel="icon" href="/nobody/favicon.ico"
type="image/vnd.microsoft.icon" />n<link rel="shortcut icon"
href="/nobody/favicon.ico" type="image/vnd.microsoft.icon" />n<link
rel="bookmark" href="/nobody/favicon.ico"
type="image/vnd.microsoft.icon" />n<meta
http-equiv="Content-Type" content="text/html;
charset=utf-8">n<meta name="googlebot"
content="nosnippet">n<meta name="robots"
content="noarchive">n<title>::: Login :::</title>n<style>n<!--nbody
{background-image: url(/nobody/jpg/bg.jpg); margin-left:
0px;margin-top: 0px;margin-right: 0px;margin-bottom: 0px;}ntd {
font-size:14px;color:#FFFFFF;font-weight:bold; font-famil
Security
camera
41
Challenges
● Unbalanced dataset (could be changed in future)
● It’s hard to obtain ground truth labels (expert knowledge)
● Different categories of features (numerical, categorical, text)
● Missing values - some devices don’t have specific features at all, only a subset
○ Empty ports
○ Empty response strings
○ Randomized MAC-address
42
Ensemble classifier
43
44
One fixed datasets with all features: ports, mac, DHCP response, etc
Preprocessing for
Classifier #1
Preprocessing for
Classifier #2
Preprocessing for
Classifier #N. . . . .
Classifier
#1
Classifier
#N
Classifier
#2
. . . . .
p_1 …. p_n p_1 …. p_n p_1 …. p_n
Ensemble classifier
Label
p_1, … ,p_n -
probabilities for
device_class_1,
... ,
device_class_n
44
Advantages of ensembling
● More accurate than individual classifiers
● Individual classifier is responsible for specific features
● You can tune individual classifier and see the change on accuracy
● Explainable
● Able to backtrack
45
46
One fixed datasets with all features: ports, mac, DHCP response, etc
Preprocessing for
Classifier #1
Preprocessing for
Classifier #2
Preprocessing for
Classifier #N. . . . .
Classifier
#1
Classifier
#N
Classifier
#2
. . . . .
p_1 …. p_n p_1 …. p_n p_1 …. p_n
Ensemble classifier
Label
p_1, … ,p_n -
probabilities for
device_class_1,
... ,
device_class_n
46
Silver classifier on MAC and open ports without labels
● If you have non-labeled dataset, how to quickly get labelled one from it ?
47
X ? X y
Semi-supervised iterative learning
1. Cluster one-hot encoded ports and vendor
features
2. Then expert labels some of the clusters
3. Collect labelled devices and call them silver
labels
4. Train classifier on silver labels
5. Run classifier on the whole dataset
6. Repeat step 1 everything for unclassified
examples (unclassified = low probability)
48
Cluster
Manually
label some
clusters
Train classifier
on labelled
Run classifier
on all data
Take
unclassified
Classification coverage after 3 iterations
49
0% 30% 77%
2 hours of
labelling in
total
classifier
Clustering
● ~1500 one-hot-encoded features
● Biased dataset towards rare classes (undersampling)
● Jaccard distance for similarity measure
50
0 0 1 0 1 0 0 1 ... ... 0 0 0 1 0 0 0
Ports Vendors
Results of Silver classifier
51
accuracy on test set able to classify agreement with
rule-based labels
contains initial
silver labels
99% 77% 55% 30%
52
One fixed datasets with all features: ports, mac, DHCP response, etc
Preprocessing for
Classifier #1
Preprocessing for
Classifier #2
Preprocessing for
Classifier #N. . . . .
Classifier
#1
Classifier
#N
Classifier
#2
. . . . .
p_1 …. p_n p_1 …. p_n p_1 …. p_n
Ensemble classifier
Label
p_1, … ,p_n -
probabilities for
device_class_1,
... ,
device_class_n
52
Components based on protocol responses
● DHCP - info about operating system
● zeroconf and UPnP- info about available services
● HTTP - info about user agent, server, location, admin interface
● General scheme:
○ Prepare xml/html/text documents
○ Text extraction based on heuristics specific for every protocol
○ Estimate probabilities of class given a word P(security camera | “cgi” ) = 0.81
○ Cluster words that often occur in the same documents
○ For a new device find the closest cluster
53
54
Device identification using
Internet traffic
Martin Neznal
55
One fixed datasets with all features: ports, mac, DHCP response, etc
Preprocessing for
Classifier #1
Preprocessing for
Classifier #2
Preprocessing for
Classifier #N. . . . .
Classifier
#1
Classifier
#N
Classifier
#2
. . . . .
p_1 …. p_n p_1 …. p_n p_1 …. p_n
Ensemble classifier
Label
p_1, … ,p_n -
probabilities for
device_class_1,
... ,
device_class_n
55
56
57
58
59
Time-series based learning
● Number of events in time
● Any continuous or discrete value
○ E.g. number of flows or number of unique
destinations
● Any time interval
● Challenges
○ Robust to different behaviors
○ Robust to legitimate anomalies
○ Robust to turned on/off
Dynamic Time Warping
● Algorithm for measuring similarity of two time-series
● Calculates distance between the series
● Output ∈ [0, +∞[
○ More similar when output closer to zero
● O(N2
)
● Time-series can have different lengths
0
1
1 0
1
Dynamic Time Warping example
DTW( , )
61
= 0.2
= 0.6DTW( , )
One component: Time based classifier
● Time-series for each device
● Similarities of behaviors
● Classification of devices
62
Time for classifier’s decision
● Each component needs some time to produce a
decision about a device
● The time is different for each classifier
● Precision vs. time
● Silver classifier
○ seconds - minutes
● Protocol responses classifiers
○ minutes - hours
● Time-series classifier
○ minutes - hours
63
Silver classifier
HTTP response
DHCP response
Time-series analysis
64
Perceptual phishing
detection
Libor Mořkovský & Tomáš Trnka
65
There is only one legitimate site!
There is only one amazon!
https://www.amazon.com/ap/signin http://secure-login-verify-myaccount-information.ga/signin/
Target Attacker
66
Schema of processing
67
Url
filtering
Image
acquisition
Key Point
detection
Matching Evaluation &
verification
Catching the phishers
- Challenge
- decide as early as possible
- we’re able to crawl and screenshot only for a fraction
- Solution
- discard most of the URLs with reasonable confidence
- whitelist + whitelist exceptions
- string matches on fishy keywords (approximate)
- prefer likely distribution channels (email, facebook)
68
Catching the phishers
- Whitelist
- We use 1 milion of most popular top private domains over several
months
- Whitelist exceptions
- but some popular sites are evil
- additionally we collected popular TPDs which can easily host user
content, and we ‘subtract’ them from the list
- e. g. google.com but not sites.google.com
69
Catching the phishers
Approximate matching
- tokens from target sites
- adobe, alibaba, aliexpress, coinsbank,
credit-suisse
Exact match (strings shorter than 5)
- often misused constructs (paypal.com-blabla.gd)
- .com-, .org., cgi-bin
70
Schema of processing
71
Url
filtering
Image
acquisition
Key Point
detection
Matching Evaluation &
verification
How to match two images?
Unpopular, but looks similar? Fishy!
72
Schema of processing
73
Url
filtering
Image
acquisition
Key Point
detection
Matching Evaluation &
verification
Detecting the points
• Interesting points: points, edges, blobs
• Many methods how to detect the
interesting points in the picture - Corner
detectors, Edge detectors
• Examples of the points
74
Describing the points
• Abstract (mathematical)
representation of the detected
points
• Extract the patches
• FAST, SIFT, ORB, ...
75
Schema of processing
76
Url
filtering
Image
acquisition
Key Point
detection
Matching Evaluation &
verification
Matching and verification
- Find the matrix transforming the points from one picture to another
- The hypothesis for transformation is done by 4 points, we sample them
and determine the transformation with RANdom SAmple Consensus
77
Credits: Scikit-learn
Matching and verification
- Evaluate how reasonable the transformation is
- valid perspective transformation
- thresholds on scale, rotation, sheer, translation
78
1 2
3
4
1
3
4
2
4
3
Schema of processing
79
Url
filtering
Image
acquisition
Key Point
detection
Matching Evaluation &
verification
- Until now the only problem is (small) text blocks
- Text generates a lot of keypoints (high local contrast)
- A text block matches any other text block (the descriptors are all similar)
- Possible solutions
- ignore images with many key points
- detect text through high density in single SIFT octave
- use external text detector to mask areas
Problems
80
Conclusion & Future work
- Persuade the phishers that we are accessing their sites from different
locations
- More precise but still fast text detection mechanism
- Make it even faster ;)
81
Marek Krčál, Avast fellow at Institute of Computer Science
Joint work with Ondřej Švec, Martin Bálek, and Otakar Jašek
Deep Convolutional Malware Classifiers Can Learn
from Raw Executables and Labels Only
Marek Krčál, Avast fellow at Institute of Computer Science
Joint work with Ondřej Švec, Martin Bálek, and Otakar Jašek
Deep Convolutional Malware Classifiers Can Learn
from Raw Executables and Labels Only
• Can the success of convnets (incl. end-to-end learning)
be transferred to malware detection?
Marek Krčál, Avast fellow at Institute of Computer Science
Joint work with Ondřej Švec, Martin Bálek, and Otakar Jašek
Deep Convolutional Malware Classifiers Can Learn
from Raw Executables and Labels Only
• Can the success of convnets (incl. end-to-end learning)
be transferred to malware detection?
• Applied on Windows executables,
but is rather domain-agnostic: Other file formats and
blocking their content in network traffic
Convnets achieve state-of-the-art in many areas
Husky
Lilly
Eskymo
Spiral
Coffee
...
Convnets achieve state-of-the-art in many areas
Husky
Lilly
Eskymo
Spiral
Coffee
...• without feature engineering!
Convnets achieve state-of-the-art in many areas
Husky
Lilly
Eskymo
Spiral
Coffee
...• without feature engineering!
• explainability
Convnets achieve state-of-the-art in many areas
Husky
Lilly
Eskymo
Spiral
Coffee
...• without feature engineering!
• explainability
Convnets achieve state-of-the-art in many areas
Husky
Lilly
Eskymo
Spiral
Coffee
...• without feature engineering!
• explainability
why boxer?
Convnets achieve state-of-the-art in many areas
Husky
Lilly
Eskymo
Spiral
Coffee
...• without feature engineering!
• explainability
why boxer?
Convnets achieve state-of-the-art in many areas
Husky
Lilly
Eskymo
Spiral
Coffee
...• without feature engineering!
• explainability
why boxer?
Detection of Malicious Portable Executables
Detection of Malicious Portable Executables
• Input is 1D – sequence of bytes (static malware analysis)
Detection of Malicious Portable Executables
• Input is 1D – sequence of bytes (static malware analysis)
– Consists of header, sections, relocation tables, strings
–no global semantics of byte symbols
– Each can appear at almost arbitrary palce – high
translational variance
Detection of Malicious Portable Executables
• Input is 1D – sequence of bytes (static malware analysis)
• We use no domain expertise aside from labels
– Consists of header, sections, relocation tables, strings
–no global semantics of byte symbols
– Each can appear at almost arbitrary palce – high
translational variance
Detection of Malicious Portable Executables
• Input is 1D – sequence of bytes (static malware analysis)
• We use no domain expertise aside from labels
• Only two classes clean and malware (for simplicity)
– Consists of header, sections, relocation tables, strings
–no global semantics of byte symbols
– Each can appear at almost arbitrary palce – high
translational variance
Dataset – 20 million of executables
• Less files with compressed/encrypted machine code
• Between 12kB and 1/2 MB
Dataset – 20 million of executables
• Less files with compressed/encrypted machine code
• Between 12kB and 1/2 MB
• Temporal split:
1.1.2016 1.1.2017 week 8 week 16
training validation test
Dataset – 20 million of executables
• Less files with compressed/encrypted machine code
• Between 12kB and 1/2 MB
• Temporal split:
1.1.2016 1.1.2017 week 8 week 16
training validation test
Dataset – 20 million of executables
• Less files with compressed/encrypted machine code
• Between 12kB and 1/2 MB
• Temporal split:
1.1.2016 1.1.2017 week 8 week 16
training validation test
• roughly balanced clean and malware classes
Architecture
Fully Connected
Fully Connected
Fully Connected
Fixed Embedding
Conv 32 (stride 4)
Conv 32 (stride 4)
Max pooling 4
Conv 16 (stride 8)
Conv 16 (stride 8)
Fully Connected
Fully Connected
Fully Connected
Global Average
Fully Connected
Fixed Embedding
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Executable as sequence of N bytes
Architecture
Fully Connected
Fully Connected
Fully Connected
Fixed Embedding
Conv 32 (stride 4)
Conv 32 (stride 4)
Max pooling 4
Conv 16 (stride 8)
Conv 16 (stride 8)
Fully Connected
Fully Connected
Fully Connected
Global Average
Fully Connected
Fixed Embedding
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Executable as sequence of N bytes
Fixed embedding:
byte → (±1/16,. . .,±1/16) ∈ R8
according to the byte’s bits
Architecture
Fully Connected
Fully Connected
Fully Connected
Fixed Embedding
Conv 32 (stride 4)
Conv 32 (stride 4)
Max pooling 4
Conv 16 (stride 8)
Conv 16 (stride 8)
Fully Connected
Fully Connected
Fully Connected
Global Average
Fully Connected
Fixed Embedding
8 × N
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Executable as sequence of N bytes
Fixed embedding:
byte → (±1/16,. . .,±1/16) ∈ R8
according to the byte’s bits
Architecture
192×N/
4·4·4·8·8
4096
Fully Connected
Fully Connected
Fully Connected
Fixed Embedding
Conv 32 (stride 4)
Conv 32 (stride 4)
Max pooling 4
Conv 16 (stride 8)
Conv 16 (stride 8)
96×N/16
96×N/64
128×N/512
Fully Connected
Fully Connected
Fully Connected
Global Average
Fully Connected
Fixed Embedding
8 × N
48×N/4
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Executable as sequence of N bytes
96×N/64
Fixed embedding:
byte → (±1/16,. . .,±1/16) ∈ R8
according to the byte’s bits
Architecture
192×N/
4·4·4·8·8
4096
Fully Connected
Fully Connected
Fully Connected
Fixed Embedding
Conv 32 (stride 4)
Conv 32 (stride 4)
Max pooling 4
Conv 16 (stride 8)
Conv 16 (stride 8)
96×N/16
96×N/64
128×N/512
Fully Connected
Fully Connected
Fully Connected
Global Average
Fully Connected
Fixed Embedding
8 × N
48×N/4
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Executable as sequence of N bytes
96×N/64
Fixed embedding:
byte → (±1/16,. . .,±1/16) ∈ R8
according to the byte’s bits
Power-of-two strides: improve
speed and performance
Architecture
192×N/
4·4·4·8·8
4096
Fully Connected
Fully Connected
Fully Connected
Fixed Embedding
Conv 32 (stride 4)
Conv 32 (stride 4)
Max pooling 4
Conv 16 (stride 8)
Conv 16 (stride 8)
96×N/16
96×N/64
128×N/512
Fully Connected
Fully Connected
Fully Connected
Global Average
Fully Connected
Fixed Embedding
8 × N
48×N/4
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Executable as sequence of N bytes
96×N/64
Fixed embedding:
byte → (±1/16,. . .,±1/16) ∈ R8
according to the byte’s bits
Power-of-two strides: improve
speed and performance
strides 3,5,7,9 instead of 4,4,8,8
harm performance by 6-10%
Architecture
192×N/
4·4·4·8·8
4096
Fully Connected
Fully Connected
Fully Connected
Fixed Embedding
Conv 32 (stride 4)
Conv 32 (stride 4)
Max pooling 4
Conv 16 (stride 8)
Conv 16 (stride 8)
96×N/16
96×N/64
128×N/512
Fully Connected
Fully Connected
Fully Connected
Global Average
Fully Connected
Fixed Embedding
8 × N
48×N/4
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Fully Connected
Executable as sequence of N bytes
96×N/64
Fixed embedding:
byte → (±1/16,. . .,±1/16) ∈ R8
according to the byte’s bits
Power-of-two strides: improve
speed and performance
strides 3,5,7,9 instead of 4,4,8,8
harm performance by 6-10%
projects variably-wide matrix
to a fixed-sized vector
Architecture
192×N/
4·4·4·8·8
4096
Fully Connected
Fully Connected
Fully Connected
Fixed Embedding
Conv 32 (stride 4)
Conv 32 (stride 4)
Max pooling 4
Conv 16 (stride 8)
Conv 16 (stride 8)
96×N/16
96×N/64
128×N/512
192
192
160
128
2
Fully Connected
Fully Connected
Fully Connected
Global Average
192
192
160
128
2
Fully Connected
Fixed Embedding
8 × N
48×N/4
Fully Connected
Fully Connected
Fully Connected
192
160
128
2
Fully Connected
Fully Connected
Fully Connected
192
160
128
2
Fully Connected
Fully Connected
Fully Connected
Fully Connected
192
160
128
2
Fully Connected
Fully Connected
Fully Connected
192
160
128
2
Fully Connected
Executable as sequence of N bytes
96×N/64
Fixed embedding:
byte → (±1/16,. . .,±1/16) ∈ R8
according to the byte’s bits
Power-of-two strides: improve
speed and performance
strides 3,5,7,9 instead of 4,4,8,8
harm performance by 6-10%
projects variably-wide matrix
to a fixed-sized vector
Architecture
192×N/
4·4·4·8·8
4096
Fully Connected
Fully Connected
Fully Connected
Fixed Embedding
Conv 32 (stride 4)
Conv 32 (stride 4)
Max pooling 4
Conv 16 (stride 8)
Conv 16 (stride 8)
96×N/16
96×N/64
128×N/512
192
192
160
128
2
Fully Connected
Fully Connected
Fully Connected
Global Average
192
192
160
128
2
Fully Connected
Fixed Embedding
8 × N
48×N/4
Fully Connected
Fully Connected
Fully Connected
192
160
128
2
Fully Connected
Fully Connected
Fully Connected
192
160
128
2
Fully Connected
Fully Connected
Fully Connected
Fully Connected
192
160
128
2
Fully Connected
Fully Connected
Fully Connected
192
160
128
2
Fully Connected
Executable as sequence of N bytes
96×N/64
Training:7-fold influence of
clean on loss, mild weight decay,
Adam,...
Fixed embedding:
byte → (±1/16,. . .,±1/16) ∈ R8
according to the byte’s bits
Power-of-two strides: improve
speed and performance
strides 3,5,7,9 instead of 4,4,8,8
harm performance by 6-10%
projects variably-wide matrix
to a fixed-sized vector
Evaluation
• Evaluation in the regime of low false positives (formally
area under the receiver operator curve restricted to
[0, 0.001] – AUC|<0.001)
Evaluation
• Evaluation in the regime of low false positives (formally
area under the receiver operator curve restricted to
[0, 0.001] – AUC|<0.001)
False Positives Rate
True Positives Rate
Evaluation
• Evaluation in the regime of low false positives (formally
area under the receiver operator curve restricted to
[0, 0.001] – AUC|<0.001)
False Positives Rate
True Positives Rate
• Evaluation score matters: Gobal Average (instead of Max),
strong emphasis of clean files,...
Competing architecture (dataset matters)
“MalConv” convnet by Raff et al. (Univ. Maryland + NVIDIA)
Competing architecture (dataset matters)
EmbeddingFixed Embedding
8 × N
EmbeddingGlobal Max
128 × (N/512)
Fully Connected
128
2
Fully Connected
Gated Conv 512 (stride 512)
“MalConv” convnet by Raff et al. (Univ. Maryland + NVIDIA)
Competing architecture (dataset matters)
AUC[0,0.001]
Our architecture 0.704 ± 0.005
MalConv (competitor) 0.661 ± 0.009
EmbeddingFixed Embedding
8 × N
EmbeddingGlobal Max
128 × (N/512)
Fully Connected
128
2
Fully Connected
Gated Conv 512 (stride 512)
“MalConv” convnet by Raff et al. (Univ. Maryland + NVIDIA)
Automatic vs. hand-crafted features
550 hand-crafted in-house features (last year @MLP)
Automatic vs. hand-crafted features
550 hand-crafted in-house features (last year @MLP)
feed to a 5-layer feedforward net (same set of samples):
Automatic vs. hand-crafted features
550 hand-crafted in-house features (last year @MLP)
feed to a 5-layer feedforward net (same set of samples):
AUC|<0.001
convolution features 70.4 ± 0.5%
hand-crafted features 73.2 ± 2.3% (similar in accur. and x-entropy)
Automatic vs. hand-crafted features
550 hand-crafted in-house features (last year @MLP)
feed to a 5-layer feedforward net (same set of samples):
AUC|<0.001
convolution features 70.4 ± 0.5%
hand-crafted features 73.2 ± 2.3% (similar in accur. and x-entropy)
ensambled features 76.1 ± 1.0% (much better accur and x-entr.)
Automatic vs. hand-crafted features
550 hand-crafted in-house features (last year @MLP)
feed to a 5-layer feedforward net (same set of samples):
AUC|<0.001
convolution features 70.4 ± 0.5%
hand-crafted features 73.2 ± 2.3% (similar in accur. and x-entropy)
ensambled features 76.1 ± 1.0% (much better accur and x-entr.)
Convnets slightly below Avast’s know-how, but already good
at feature enrichment
Automatic vs. hand-crafted features
550 hand-crafted in-house features (last year @MLP)
feed to a 5-layer feedforward net (same set of samples):
AUC|<0.001
convolution features 70.4 ± 0.5%
hand-crafted features 73.2 ± 2.3% (similar in accur. and x-entropy)
ensambled features 76.1 ± 1.0% (much better accur and x-entr.)
Convnets slightly below Avast’s know-how, but already good
at feature enrichment
– Dataset easier for convnets
Automatic vs. hand-crafted features
550 hand-crafted in-house features (last year @MLP)
feed to a 5-layer feedforward net (same set of samples):
AUC|<0.001
convolution features 70.4 ± 0.5%
hand-crafted features 73.2 ± 2.3% (similar in accur. and x-entropy)
ensambled features 76.1 ± 1.0% (much better accur and x-entr.)
Convnets slightly below Avast’s know-how, but already good
at feature enrichment
– Dataset easier for convnets
+ Improvement potential, transferable to other domains
Explainability
• grad-CAM (Class Activation Map):
Explainability
• grad-CAM (Class Activation Map):
which “pixels” of the last conv layer
caused the prediction?
Explainability
• grad-CAM (Class Activation Map):
which “pixels” of the last conv layer
caused the prediction?
Byte-level explanations:
Guided Backprop
Byte-level explanations:
Guided Backprop
• header of an embedded PE
Byte-level explanations:
Guided Backprop
• header of an embedded PE
• “VERSION_INFO” with a fake vendor and software name
Byte-level explanations:
Guided Backprop
• header of an embedded PE
• “VERSION_INFO” with a fake vendor and software name
• unusual imported functions
Byte-level explanations:
Guided Backprop
• header of an embedded PE
• “VERSION_INFO” with a fake vendor and software name
Future work
• Improve speed (separable convolutions, mixture of experts)
Future work
• Improve speed (separable convolutions, mixture of experts)
• More diverse and larger dataset
Future work
• Improve speed (separable convolutions, mixture of experts)
• More diverse and larger dataset
• Apply to (other file types relevant for) network traffic
Future work
• Improve speed (separable convolutions, mixture of experts)
• More diverse and larger dataset
• Apply to (other file types relevant for) network traffic
Questions?
Convolutional Nets
Convolutional Nets
input aligned in Euclidean space
Convolutional Nets
input aligned in Euclidean space
1D (sequence)
x1
xn
...
Convolutional Nets
input aligned in Euclidean space
1D (sequence)
x1
xn
...
convolutional layer:
...
Convolutional Nets
input aligned in Euclidean space
1D (sequence)
x1
xn
...
convolutional layer:
...
Convolutional Nets
input aligned in Euclidean space
1D (sequence)
x1
xn
...
convolutional layer:
...
Convolutional Nets
input aligned in Euclidean space
1D (sequence)
x1
xn
...
θ1
θ3
θ1
θ3
convolutional layer:
...
Convolutional Nets
input aligned in Euclidean space
1D (sequence)
x1
xn
...
θ1
θ3
θ1
θ3
convenient to implement some invariance on translation
convolutional layer:
...

Weitere ähnliche Inhalte

Was ist angesagt?

Automated Malware Analysis and Cyber Security Intelligence
Automated Malware Analysis and Cyber Security IntelligenceAutomated Malware Analysis and Cyber Security Intelligence
Automated Malware Analysis and Cyber Security IntelligenceJason Choi
 
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics Platform
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics PlatformAutopsy 3: Free Open Source End-to-End Windows-based Digital Forensics Platform
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics PlatformJason Letourneau
 
Current Conditions and Challenges of Cybersecurity in Taiwan
Current Conditions and Challenges of Cybersecurity in TaiwanCurrent Conditions and Challenges of Cybersecurity in Taiwan
Current Conditions and Challenges of Cybersecurity in TaiwanAPNIC
 
J_McConnell_LabReconnaissance
J_McConnell_LabReconnaissanceJ_McConnell_LabReconnaissance
J_McConnell_LabReconnaissanceJuanita McConnell
 
Network traffic analysis with cyber security
Network traffic analysis with cyber securityNetwork traffic analysis with cyber security
Network traffic analysis with cyber securityKAMALI PRIYA P
 
Evaluation of Snort using Rules for DARPA 1999 Dataset
Evaluation of Snort using Rules for DARPA 1999 DatasetEvaluation of Snort using Rules for DARPA 1999 Dataset
Evaluation of Snort using Rules for DARPA 1999 DatasetIJCSIS Research Publications
 
Wireless security testing with attack by Keiichi Horiai - CODE BLUE 2015
Wireless security testing with attack by Keiichi Horiai - CODE BLUE 2015Wireless security testing with attack by Keiichi Horiai - CODE BLUE 2015
Wireless security testing with attack by Keiichi Horiai - CODE BLUE 2015CODE BLUE
 
CISSP Week 16
CISSP Week 16CISSP Week 16
CISSP Week 16jemtallon
 

Was ist angesagt? (11)

Automated Malware Analysis and Cyber Security Intelligence
Automated Malware Analysis and Cyber Security IntelligenceAutomated Malware Analysis and Cyber Security Intelligence
Automated Malware Analysis and Cyber Security Intelligence
 
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics Platform
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics PlatformAutopsy 3: Free Open Source End-to-End Windows-based Digital Forensics Platform
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics Platform
 
Current Conditions and Challenges of Cybersecurity in Taiwan
Current Conditions and Challenges of Cybersecurity in TaiwanCurrent Conditions and Challenges of Cybersecurity in Taiwan
Current Conditions and Challenges of Cybersecurity in Taiwan
 
J_McConnell_LabReconnaissance
J_McConnell_LabReconnaissanceJ_McConnell_LabReconnaissance
J_McConnell_LabReconnaissance
 
Network traffic analysis with cyber security
Network traffic analysis with cyber securityNetwork traffic analysis with cyber security
Network traffic analysis with cyber security
 
aaa
aaaaaa
aaa
 
Sectools
SectoolsSectools
Sectools
 
Evaluation of Snort using Rules for DARPA 1999 Dataset
Evaluation of Snort using Rules for DARPA 1999 DatasetEvaluation of Snort using Rules for DARPA 1999 Dataset
Evaluation of Snort using Rules for DARPA 1999 Dataset
 
Wireless security testing with attack by Keiichi Horiai - CODE BLUE 2015
Wireless security testing with attack by Keiichi Horiai - CODE BLUE 2015Wireless security testing with attack by Keiichi Horiai - CODE BLUE 2015
Wireless security testing with attack by Keiichi Horiai - CODE BLUE 2015
 
Dracos forensic flavor
Dracos forensic flavorDracos forensic flavor
Dracos forensic flavor
 
CISSP Week 16
CISSP Week 16CISSP Week 16
CISSP Week 16
 

Ähnlich wie Avast @ Machine Learning

[cb22] Red light in the factory - From 0 to 100 OT adversary emulation by Vi...
[cb22] Red light in the factory - From 0 to 100 OT adversary emulation by  Vi...[cb22] Red light in the factory - From 0 to 100 OT adversary emulation by  Vi...
[cb22] Red light in the factory - From 0 to 100 OT adversary emulation by Vi...CODE BLUE
 
Disruptionware-TRustedCISO103020v0.7.pptx
Disruptionware-TRustedCISO103020v0.7.pptxDisruptionware-TRustedCISO103020v0.7.pptx
Disruptionware-TRustedCISO103020v0.7.pptxDebra Baker, CISSP CSSP
 
Who needs iot security?
Who needs iot security?Who needs iot security?
Who needs iot security?Justin Black
 
CSW2017 Yuhao song+Huimingliu cyber_wmd_vulnerable_IoT
CSW2017 Yuhao song+Huimingliu cyber_wmd_vulnerable_IoTCSW2017 Yuhao song+Huimingliu cyber_wmd_vulnerable_IoT
CSW2017 Yuhao song+Huimingliu cyber_wmd_vulnerable_IoTCanSecWest
 
OWASP Appsec USA 2014 Talk "Pwning the Pawns with Wihawk" Santhosh Kumar
OWASP Appsec USA 2014 Talk "Pwning the Pawns with Wihawk" Santhosh Kumar OWASP Appsec USA 2014 Talk "Pwning the Pawns with Wihawk" Santhosh Kumar
OWASP Appsec USA 2014 Talk "Pwning the Pawns with Wihawk" Santhosh Kumar Santhosh Kumar
 
AktaionPPTv5_JZedits
AktaionPPTv5_JZeditsAktaionPPTv5_JZedits
AktaionPPTv5_JZeditsRod Soto
 
Why is it so hard to make secure chips?
Why is it so hard to make secure chips?Why is it so hard to make secure chips?
Why is it so hard to make secure chips?Riscure
 
Detecting and Confronting Flash Attacks from IoT Botnets
Detecting and Confronting Flash Attacks from IoT BotnetsDetecting and Confronting Flash Attacks from IoT Botnets
Detecting and Confronting Flash Attacks from IoT BotnetsFarjad Noor
 
Internet of Things - Privacy and Security issues
Internet of Things - Privacy and Security issuesInternet of Things - Privacy and Security issues
Internet of Things - Privacy and Security issuesPierluigi Paganini
 
Webinar: Vawtrak v2 the next big Banking Trojan
Webinar: Vawtrak v2 the next big Banking TrojanWebinar: Vawtrak v2 the next big Banking Trojan
Webinar: Vawtrak v2 the next big Banking TrojanBlueliv
 
DEFCON 23 Why Nation-State Malwares Target Telco Networks - OMER COSKUN
DEFCON 23 Why Nation-State Malwares Target Telco Networks - OMER COSKUNDEFCON 23 Why Nation-State Malwares Target Telco Networks - OMER COSKUN
DEFCON 23 Why Nation-State Malwares Target Telco Networks - OMER COSKUNÖmer Coşkun
 
Defcon23 why nation-state_malware_target_telco_omercoskun
Defcon23 why nation-state_malware_target_telco_omercoskunDefcon23 why nation-state_malware_target_telco_omercoskun
Defcon23 why nation-state_malware_target_telco_omercoskunÖmer Coşkun
 
Breaking Smart Speakers: We are Listening to You.
Breaking Smart Speakers: We are Listening to You.Breaking Smart Speakers: We are Listening to You.
Breaking Smart Speakers: We are Listening to You.Priyanka Aash
 
Combating Cyberattacks through Network Agility and Automation
Combating Cyberattacks through Network Agility and AutomationCombating Cyberattacks through Network Agility and Automation
Combating Cyberattacks through Network Agility and AutomationSagi Brody
 
How to build corporate size fraud prevention
How to build corporate size fraud preventionHow to build corporate size fraud prevention
How to build corporate size fraud preventionRakuten Group, Inc.
 
Beginner’s Guide on How to Start Exploring IoT Security 1st Session
Beginner’s Guide on How to Start Exploring IoT Security 1st SessionBeginner’s Guide on How to Start Exploring IoT Security 1st Session
Beginner’s Guide on How to Start Exploring IoT Security 1st Sessionveerababu penugonda(Mr-IoT)
 
Workshop on Cyber security and investigation
Workshop on Cyber security and investigationWorkshop on Cyber security and investigation
Workshop on Cyber security and investigationMehedi Hasan
 

Ähnlich wie Avast @ Machine Learning (20)

[cb22] Red light in the factory - From 0 to 100 OT adversary emulation by Vi...
[cb22] Red light in the factory - From 0 to 100 OT adversary emulation by  Vi...[cb22] Red light in the factory - From 0 to 100 OT adversary emulation by  Vi...
[cb22] Red light in the factory - From 0 to 100 OT adversary emulation by Vi...
 
Disruptionware-TRustedCISO103020v0.7.pptx
Disruptionware-TRustedCISO103020v0.7.pptxDisruptionware-TRustedCISO103020v0.7.pptx
Disruptionware-TRustedCISO103020v0.7.pptx
 
Who needs iot security?
Who needs iot security?Who needs iot security?
Who needs iot security?
 
CSW2017 Yuhao song+Huimingliu cyber_wmd_vulnerable_IoT
CSW2017 Yuhao song+Huimingliu cyber_wmd_vulnerable_IoTCSW2017 Yuhao song+Huimingliu cyber_wmd_vulnerable_IoT
CSW2017 Yuhao song+Huimingliu cyber_wmd_vulnerable_IoT
 
OWASP Appsec USA 2014 Talk "Pwning the Pawns with Wihawk" Santhosh Kumar
OWASP Appsec USA 2014 Talk "Pwning the Pawns with Wihawk" Santhosh Kumar OWASP Appsec USA 2014 Talk "Pwning the Pawns with Wihawk" Santhosh Kumar
OWASP Appsec USA 2014 Talk "Pwning the Pawns with Wihawk" Santhosh Kumar
 
AktaionPPTv5_JZedits
AktaionPPTv5_JZeditsAktaionPPTv5_JZedits
AktaionPPTv5_JZedits
 
Why is it so hard to make secure chips?
Why is it so hard to make secure chips?Why is it so hard to make secure chips?
Why is it so hard to make secure chips?
 
G3t R00t at IUT
G3t R00t at IUTG3t R00t at IUT
G3t R00t at IUT
 
Detecting and Confronting Flash Attacks from IoT Botnets
Detecting and Confronting Flash Attacks from IoT BotnetsDetecting and Confronting Flash Attacks from IoT Botnets
Detecting and Confronting Flash Attacks from IoT Botnets
 
Internet of Things - Privacy and Security issues
Internet of Things - Privacy and Security issuesInternet of Things - Privacy and Security issues
Internet of Things - Privacy and Security issues
 
Webinar: Vawtrak v2 the next big Banking Trojan
Webinar: Vawtrak v2 the next big Banking TrojanWebinar: Vawtrak v2 the next big Banking Trojan
Webinar: Vawtrak v2 the next big Banking Trojan
 
Hacking by Pratyush Gupta
Hacking by Pratyush GuptaHacking by Pratyush Gupta
Hacking by Pratyush Gupta
 
DEFCON 23 Why Nation-State Malwares Target Telco Networks - OMER COSKUN
DEFCON 23 Why Nation-State Malwares Target Telco Networks - OMER COSKUNDEFCON 23 Why Nation-State Malwares Target Telco Networks - OMER COSKUN
DEFCON 23 Why Nation-State Malwares Target Telco Networks - OMER COSKUN
 
Defcon23 why nation-state_malware_target_telco_omercoskun
Defcon23 why nation-state_malware_target_telco_omercoskunDefcon23 why nation-state_malware_target_telco_omercoskun
Defcon23 why nation-state_malware_target_telco_omercoskun
 
Breaking Smart Speakers: We are Listening to You.
Breaking Smart Speakers: We are Listening to You.Breaking Smart Speakers: We are Listening to You.
Breaking Smart Speakers: We are Listening to You.
 
Combating Cyberattacks through Network Agility and Automation
Combating Cyberattacks through Network Agility and AutomationCombating Cyberattacks through Network Agility and Automation
Combating Cyberattacks through Network Agility and Automation
 
IOT Exploitation
IOT Exploitation	IOT Exploitation
IOT Exploitation
 
How to build corporate size fraud prevention
How to build corporate size fraud preventionHow to build corporate size fraud prevention
How to build corporate size fraud prevention
 
Beginner’s Guide on How to Start Exploring IoT Security 1st Session
Beginner’s Guide on How to Start Exploring IoT Security 1st SessionBeginner’s Guide on How to Start Exploring IoT Security 1st Session
Beginner’s Guide on How to Start Exploring IoT Security 1st Session
 
Workshop on Cyber security and investigation
Workshop on Cyber security and investigationWorkshop on Cyber security and investigation
Workshop on Cyber security and investigation
 

Mehr von Avast

Home Security Map of the World
Home Security Map of the World Home Security Map of the World
Home Security Map of the World Avast
 
IoT and IIOT at QuBit Prague 2018
IoT and IIOT at QuBit Prague 2018 IoT and IIOT at QuBit Prague 2018
IoT and IIOT at QuBit Prague 2018 Avast
 
Avast Q1 Security Report 2015
Avast Q1 Security Report 2015Avast Q1 Security Report 2015
Avast Q1 Security Report 2015Avast
 
Where There's Money, There's Crime: Web-based Threats
Where There's Money, There's Crime: Web-based ThreatsWhere There's Money, There's Crime: Web-based Threats
Where There's Money, There's Crime: Web-based ThreatsAvast
 
Korean Banks Under Pressure
Korean Banks Under PressureKorean Banks Under Pressure
Korean Banks Under PressureAvast
 
Every Click Counts (But All the Money Goes to Me)
Every Click Counts (But All the Money Goes to Me)Every Click Counts (But All the Money Goes to Me)
Every Click Counts (But All the Money Goes to Me)Avast
 
Google-image poisoning: How hackers use images to spread malware
Google-image poisoning: How hackers use images to spread malwareGoogle-image poisoning: How hackers use images to spread malware
Google-image poisoning: How hackers use images to spread malwareAvast
 

Mehr von Avast (7)

Home Security Map of the World
Home Security Map of the World Home Security Map of the World
Home Security Map of the World
 
IoT and IIOT at QuBit Prague 2018
IoT and IIOT at QuBit Prague 2018 IoT and IIOT at QuBit Prague 2018
IoT and IIOT at QuBit Prague 2018
 
Avast Q1 Security Report 2015
Avast Q1 Security Report 2015Avast Q1 Security Report 2015
Avast Q1 Security Report 2015
 
Where There's Money, There's Crime: Web-based Threats
Where There's Money, There's Crime: Web-based ThreatsWhere There's Money, There's Crime: Web-based Threats
Where There's Money, There's Crime: Web-based Threats
 
Korean Banks Under Pressure
Korean Banks Under PressureKorean Banks Under Pressure
Korean Banks Under Pressure
 
Every Click Counts (But All the Money Goes to Me)
Every Click Counts (But All the Money Goes to Me)Every Click Counts (But All the Money Goes to Me)
Every Click Counts (But All the Money Goes to Me)
 
Google-image poisoning: How hackers use images to spread malware
Google-image poisoning: How hackers use images to spread malwareGoogle-image poisoning: How hackers use images to spread malware
Google-image poisoning: How hackers use images to spread malware
 

Kürzlich hochgeladen

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 

Kürzlich hochgeladen (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 

Avast @ Machine Learning

  • 1. Avast @ Machine Learning Prague - HANDOUT S in IoT stands for Security
  • 2. Our team: IoT and ML Research department 2 Galina Adam Tomáš Marek Martin Vláďa Martin
  • 3. Avast in numbers Around 2000 employees 435 million users Users in over 150 countries Protecting from 3.5 billion attacks per month Blocked 128 million ransomware attacks in 2016 Our engines check 200 billion URLs and 300 million new files monthly 3
  • 5. • Number of IoT devices is on the rise, expected to have 75 billion of connected things by 2025 • IP Cameras • Network attached storages • Thermostats • Smart speakers • Digital personal assistants • … you name it, the IoT world will have it Future: The IoT world 5
  • 6. • IoT products: • convenience • usability • not necessarily to be easily secured • They can be compromised in many ways • Spy on users • Blackmail users • Gain physical access to the home • Misuse of devices for • Attacking third-party services • Misuse of computational power Future: Securing the IoT world 6
  • 7. • Avast is developing a AI-based protection for IoT • The key to protecting IoT devices: Cloud-based security that monitors for threats at the network level Avast Smart Life - a new product coming 7
  • 8. 09:00 - 09:30 Yin and Yang of IoT 09:30 - 10:00 A Case study: Mirai attack vector 10:00 - 10:30 Machine learning algorithms and feature engineering 10:30 - 11:00 Coffee Break 11:00 - 11:30 Neural networks for classification of binary files 11:30 - 12:00 Identifying devices within the network 12:00 - 12:30 Phishing prevention and blocking of malicious URLs Workshop structure 8
  • 10. Botnet Set of enslaved devices (usually IoT) that can be controlled by a cybercriminal OR Malware that enslaves the devices 10
  • 11. But why? Can be used to • Gain computational power • Perform distributed denial-of-service attack (DDoS attack) • Send spam • Mine Cryptocurrencies • Steal data DDoS Business model • Harm&Destroy your competitors and opponents • Business competitors • Political opponents, Independent journalism • Sell DDoS as a service • Blackmail companies (Money or DDoS!) 11
  • 12. Botnet components • Zombie computer • A compromised node infected by the botnet malware • Command and Control (CnC) server • Server that remotely controls the zombie computers • Botmaster • A person who operates the CnC server • Hides their identity (via Tor, proxies, …) 12
  • 14. A case study: Mirai • Attacks vulnerable IoT devices with factory-default credentials • IP cameras • Network storages • Client - Server architecture • Two main component are the Mirai itself and a C&C server • Both available on Github • Spreads like a worm over the internet: each infected node scans the whole IPv4 • Does not attack DoD of USA • Not persistent - device restarts and mirai disappears (and comes soon again - C&C has memory) 14
  • 15. A case study: Mirai • October 2016: One of the most impactful cyber attacks ever against online infrastructure firm Dyn impacting Twitter, Spotify, Reddit, Airbnb, Netflix,... • Also hit and disabled krebsonsecurity.com just hours after Krebs presented a talk on Mirai at a conference • Solution: Google’s Project Shield protection. • Attacks with power of 600 - 1500 Gb/s • 150 000 enslaved devices make for 1Tb of DDoS capability 15
  • 16. Mirai: Vulnerable device setup • Telnet running (big mistake, but still very popular) • Forwards its port 23 to the router (i.e. port 23 is visible from the outside of the network) 16
  • 17. Mirai: Dictionary Attack • Knock knock, will you let me in? • Mirai has a predefined dictionary of factory-setting credentials, tries them randomly • Sends the commands via telnet (plaintext) • 17
  • 18. Mirai: Dictionary Attack • Knock knock, will you let me in? • Mirai has a predefined dictionary of factory-setting credentials, tries them randomly • Sends the commands via telnet (plaintext) 18
  • 19. Mirai: Password guessed It notifies the C&C server which sends a telnet command to download the mirai binary (via wget or tftp) 19
  • 20. Mirai: Mass Scan • Scans the internet and tries to infect vulnerable devices in other networks • Mirai is quite a stupid parasite: even sometimes kills its host 20
  • 21. Mirai: DDoS When the command comes, this is the result: 21 A map of internet outages in Europe and North America caused by the Dyn cyberattack (as of 21 October 2016 1:45pm Pacific Time).
  • 22. Mirai: Visualization of Activity Telnet & Mass Scan 22
  • 23. A case study: Mirai attack vector Summary of Mirai attack: • Uses vulnerable telnet • Has a list of factory settings (CHANGE YOUR PASSWORD) • Scans the whole range of the internet • Is actually very simple, nonetheless, it caused a lot of troubles in 2016 IoT Malware is very simple compared to PC malware, it will be evolving and gain complexity 23
  • 24. ML for security in the Networking context Martin Bálek
  • 25. Security of an endpoint • Block malicious servers / sites • Check traffic • Detect malicious content 25
  • 27. Our data domain • Stream of packets • Big amount of data • A lot of protocols (tcp, udp, ip, icmp, upnp, telnet, ftp, http, https…) • Deep vs. stateful packet inspection 27
  • 28. 28
  • 29. Our data domain • Flows • Source IP address • Source Port • Destination IP address • Destination Port • Protocol • + some useful stats 29
  • 31. 31 • Detect devices which are present in network • Find communication patterns • Detect malicious behavior • Make all above robust Challenges
  • 32. Data from ML point of view • Features • Are available at any time? (e.g. total amount of data) • How cheap is to calculate them? (e.g. packet interarrival time - mean, variance based) • Is the information relevant? (e.g. port) • Time series... but begin and end are possibly more important • Single packet/flow is never important (or is it?) • Traffic is not deterministic (e.g. connection issues, port or IP changes a lot) • All mixed together • A lot of legitimate scenarios (e.g. torrent vs. mass scan of the network) • Are we always sure what we are modeling? 32
  • 33. ZOO of algorithms • Unsupervised techniques • Anomaly detection • Communication patterns detection • Semisupervised methods • Device identification • Classification • Content checking • Patterns classification 33
  • 35. Task description ● Device identification: automatically identify device type and device model based on available device networking information 35
  • 36. 36
  • 37. Why this task is important ● “Security in IoT” expects ability to distinguish IoT devices from other devices => device identification ● Device type and device model are important features in malware detection 37
  • 38. Existing approach ● Expert-created rules based on device features and regexps ● Advantages: ○ Utilize expert knowledge and many different features ○ Accurate upon exact match ● Disadvantages: ○ Missclassifications for too broad rules ○ Conflicting rules ○ Unknown accuracy 38
  • 39. How we want to improve it ● More accurate model ● Can solve conflicts between “rules” automatically ● Ability to generalise ● Ability to tune every source of properties and measure the accuracy ● Level of confidence (probability) together with an answer 39
  • 40. Device identification as classification ● Classification task where classes = different device types (~20 classes like phone, security camera, printer, computer, bulb, fridge) ● More detailed task: classes = different models of the device class (thousands of classes) ● Features: ○ Scan features (MAC-address, open ports, text body of specific responses) ○ Behavioral features (patterns in traffic consumption) 40
  • 41. Example of data ports MAC Vendor Protocol Response Label [ 67, 1900, 2869 ] 4c:0b:be:43:19:28 Microsoft DHCP hostname: XboxOnenclassid: MSFT 5.0nparamlist: 1,3,6,15,31,33,43,44,46,47,121, 249,252 Game console [ 80, 443, 1900] e0:88:5d:8f:f9:11 Technico UPNP '<?xml version="1.0"?>n<root xmlns="urn:schemas-upnp-org:device-1-0">n<specVersion>n<majo r>1</major>n<minor>0</minor>n</specVersion>n<device>n<devic eType>urn:schemas-upnp-org:device:InternetGatewayDevice:1</devi ceType>n<friendlyName>ZyXEL Keenetic II</friendlyName>n<manufacturer>ZyXEL Communications Corp.</manufacturer>n<manufacturerURL>ht………. Router [ 88, 443, 554, 1900, 5353 ] 00:0e:53:15:1d:ab AvTech HTTPS '<html>n<head>n<link rel="icon" href="/nobody/favicon.ico" type="image/vnd.microsoft.icon" />n<link rel="shortcut icon" href="/nobody/favicon.ico" type="image/vnd.microsoft.icon" />n<link rel="bookmark" href="/nobody/favicon.ico" type="image/vnd.microsoft.icon" />n<meta http-equiv="Content-Type" content="text/html; charset=utf-8">n<meta name="googlebot" content="nosnippet">n<meta name="robots" content="noarchive">n<title>::: Login :::</title>n<style>n<!--nbody {background-image: url(/nobody/jpg/bg.jpg); margin-left: 0px;margin-top: 0px;margin-right: 0px;margin-bottom: 0px;}ntd { font-size:14px;color:#FFFFFF;font-weight:bold; font-famil Security camera 41
  • 42. Challenges ● Unbalanced dataset (could be changed in future) ● It’s hard to obtain ground truth labels (expert knowledge) ● Different categories of features (numerical, categorical, text) ● Missing values - some devices don’t have specific features at all, only a subset ○ Empty ports ○ Empty response strings ○ Randomized MAC-address 42
  • 44. 44 One fixed datasets with all features: ports, mac, DHCP response, etc Preprocessing for Classifier #1 Preprocessing for Classifier #2 Preprocessing for Classifier #N. . . . . Classifier #1 Classifier #N Classifier #2 . . . . . p_1 …. p_n p_1 …. p_n p_1 …. p_n Ensemble classifier Label p_1, … ,p_n - probabilities for device_class_1, ... , device_class_n 44
  • 45. Advantages of ensembling ● More accurate than individual classifiers ● Individual classifier is responsible for specific features ● You can tune individual classifier and see the change on accuracy ● Explainable ● Able to backtrack 45
  • 46. 46 One fixed datasets with all features: ports, mac, DHCP response, etc Preprocessing for Classifier #1 Preprocessing for Classifier #2 Preprocessing for Classifier #N. . . . . Classifier #1 Classifier #N Classifier #2 . . . . . p_1 …. p_n p_1 …. p_n p_1 …. p_n Ensemble classifier Label p_1, … ,p_n - probabilities for device_class_1, ... , device_class_n 46
  • 47. Silver classifier on MAC and open ports without labels ● If you have non-labeled dataset, how to quickly get labelled one from it ? 47 X ? X y
  • 48. Semi-supervised iterative learning 1. Cluster one-hot encoded ports and vendor features 2. Then expert labels some of the clusters 3. Collect labelled devices and call them silver labels 4. Train classifier on silver labels 5. Run classifier on the whole dataset 6. Repeat step 1 everything for unclassified examples (unclassified = low probability) 48 Cluster Manually label some clusters Train classifier on labelled Run classifier on all data Take unclassified
  • 49. Classification coverage after 3 iterations 49 0% 30% 77% 2 hours of labelling in total classifier
  • 50. Clustering ● ~1500 one-hot-encoded features ● Biased dataset towards rare classes (undersampling) ● Jaccard distance for similarity measure 50 0 0 1 0 1 0 0 1 ... ... 0 0 0 1 0 0 0 Ports Vendors
  • 51. Results of Silver classifier 51 accuracy on test set able to classify agreement with rule-based labels contains initial silver labels 99% 77% 55% 30%
  • 52. 52 One fixed datasets with all features: ports, mac, DHCP response, etc Preprocessing for Classifier #1 Preprocessing for Classifier #2 Preprocessing for Classifier #N. . . . . Classifier #1 Classifier #N Classifier #2 . . . . . p_1 …. p_n p_1 …. p_n p_1 …. p_n Ensemble classifier Label p_1, … ,p_n - probabilities for device_class_1, ... , device_class_n 52
  • 53. Components based on protocol responses ● DHCP - info about operating system ● zeroconf and UPnP- info about available services ● HTTP - info about user agent, server, location, admin interface ● General scheme: ○ Prepare xml/html/text documents ○ Text extraction based on heuristics specific for every protocol ○ Estimate probabilities of class given a word P(security camera | “cgi” ) = 0.81 ○ Cluster words that often occur in the same documents ○ For a new device find the closest cluster 53
  • 55. 55 One fixed datasets with all features: ports, mac, DHCP response, etc Preprocessing for Classifier #1 Preprocessing for Classifier #2 Preprocessing for Classifier #N. . . . . Classifier #1 Classifier #N Classifier #2 . . . . . p_1 …. p_n p_1 …. p_n p_1 …. p_n Ensemble classifier Label p_1, … ,p_n - probabilities for device_class_1, ... , device_class_n 55
  • 56. 56
  • 57. 57
  • 58. 58
  • 59. 59 Time-series based learning ● Number of events in time ● Any continuous or discrete value ○ E.g. number of flows or number of unique destinations ● Any time interval ● Challenges ○ Robust to different behaviors ○ Robust to legitimate anomalies ○ Robust to turned on/off
  • 60. Dynamic Time Warping ● Algorithm for measuring similarity of two time-series ● Calculates distance between the series ● Output ∈ [0, +∞[ ○ More similar when output closer to zero ● O(N2 ) ● Time-series can have different lengths 0 1 1 0 1
  • 61. Dynamic Time Warping example DTW( , ) 61 = 0.2 = 0.6DTW( , )
  • 62. One component: Time based classifier ● Time-series for each device ● Similarities of behaviors ● Classification of devices 62
  • 63. Time for classifier’s decision ● Each component needs some time to produce a decision about a device ● The time is different for each classifier ● Precision vs. time ● Silver classifier ○ seconds - minutes ● Protocol responses classifiers ○ minutes - hours ● Time-series classifier ○ minutes - hours 63 Silver classifier HTTP response DHCP response Time-series analysis
  • 65. 65 There is only one legitimate site!
  • 66. There is only one amazon! https://www.amazon.com/ap/signin http://secure-login-verify-myaccount-information.ga/signin/ Target Attacker 66
  • 67. Schema of processing 67 Url filtering Image acquisition Key Point detection Matching Evaluation & verification
  • 68. Catching the phishers - Challenge - decide as early as possible - we’re able to crawl and screenshot only for a fraction - Solution - discard most of the URLs with reasonable confidence - whitelist + whitelist exceptions - string matches on fishy keywords (approximate) - prefer likely distribution channels (email, facebook) 68
  • 69. Catching the phishers - Whitelist - We use 1 milion of most popular top private domains over several months - Whitelist exceptions - but some popular sites are evil - additionally we collected popular TPDs which can easily host user content, and we ‘subtract’ them from the list - e. g. google.com but not sites.google.com 69
  • 70. Catching the phishers Approximate matching - tokens from target sites - adobe, alibaba, aliexpress, coinsbank, credit-suisse Exact match (strings shorter than 5) - often misused constructs (paypal.com-blabla.gd) - .com-, .org., cgi-bin 70
  • 71. Schema of processing 71 Url filtering Image acquisition Key Point detection Matching Evaluation & verification
  • 72. How to match two images? Unpopular, but looks similar? Fishy! 72
  • 73. Schema of processing 73 Url filtering Image acquisition Key Point detection Matching Evaluation & verification
  • 74. Detecting the points • Interesting points: points, edges, blobs • Many methods how to detect the interesting points in the picture - Corner detectors, Edge detectors • Examples of the points 74
  • 75. Describing the points • Abstract (mathematical) representation of the detected points • Extract the patches • FAST, SIFT, ORB, ... 75
  • 76. Schema of processing 76 Url filtering Image acquisition Key Point detection Matching Evaluation & verification
  • 77. Matching and verification - Find the matrix transforming the points from one picture to another - The hypothesis for transformation is done by 4 points, we sample them and determine the transformation with RANdom SAmple Consensus 77 Credits: Scikit-learn
  • 78. Matching and verification - Evaluate how reasonable the transformation is - valid perspective transformation - thresholds on scale, rotation, sheer, translation 78 1 2 3 4 1 3 4 2 4 3
  • 79. Schema of processing 79 Url filtering Image acquisition Key Point detection Matching Evaluation & verification
  • 80. - Until now the only problem is (small) text blocks - Text generates a lot of keypoints (high local contrast) - A text block matches any other text block (the descriptors are all similar) - Possible solutions - ignore images with many key points - detect text through high density in single SIFT octave - use external text detector to mask areas Problems 80
  • 81. Conclusion & Future work - Persuade the phishers that we are accessing their sites from different locations - More precise but still fast text detection mechanism - Make it even faster ;) 81
  • 82. Marek Krčál, Avast fellow at Institute of Computer Science Joint work with Ondřej Švec, Martin Bálek, and Otakar Jašek Deep Convolutional Malware Classifiers Can Learn from Raw Executables and Labels Only
  • 83. Marek Krčál, Avast fellow at Institute of Computer Science Joint work with Ondřej Švec, Martin Bálek, and Otakar Jašek Deep Convolutional Malware Classifiers Can Learn from Raw Executables and Labels Only • Can the success of convnets (incl. end-to-end learning) be transferred to malware detection?
  • 84. Marek Krčál, Avast fellow at Institute of Computer Science Joint work with Ondřej Švec, Martin Bálek, and Otakar Jašek Deep Convolutional Malware Classifiers Can Learn from Raw Executables and Labels Only • Can the success of convnets (incl. end-to-end learning) be transferred to malware detection? • Applied on Windows executables, but is rather domain-agnostic: Other file formats and blocking their content in network traffic
  • 85. Convnets achieve state-of-the-art in many areas Husky Lilly Eskymo Spiral Coffee ...
  • 86. Convnets achieve state-of-the-art in many areas Husky Lilly Eskymo Spiral Coffee ...• without feature engineering!
  • 87. Convnets achieve state-of-the-art in many areas Husky Lilly Eskymo Spiral Coffee ...• without feature engineering! • explainability
  • 88. Convnets achieve state-of-the-art in many areas Husky Lilly Eskymo Spiral Coffee ...• without feature engineering! • explainability
  • 89. Convnets achieve state-of-the-art in many areas Husky Lilly Eskymo Spiral Coffee ...• without feature engineering! • explainability why boxer?
  • 90. Convnets achieve state-of-the-art in many areas Husky Lilly Eskymo Spiral Coffee ...• without feature engineering! • explainability why boxer?
  • 91. Convnets achieve state-of-the-art in many areas Husky Lilly Eskymo Spiral Coffee ...• without feature engineering! • explainability why boxer?
  • 92. Detection of Malicious Portable Executables
  • 93. Detection of Malicious Portable Executables • Input is 1D – sequence of bytes (static malware analysis)
  • 94. Detection of Malicious Portable Executables • Input is 1D – sequence of bytes (static malware analysis) – Consists of header, sections, relocation tables, strings –no global semantics of byte symbols – Each can appear at almost arbitrary palce – high translational variance
  • 95. Detection of Malicious Portable Executables • Input is 1D – sequence of bytes (static malware analysis) • We use no domain expertise aside from labels – Consists of header, sections, relocation tables, strings –no global semantics of byte symbols – Each can appear at almost arbitrary palce – high translational variance
  • 96. Detection of Malicious Portable Executables • Input is 1D – sequence of bytes (static malware analysis) • We use no domain expertise aside from labels • Only two classes clean and malware (for simplicity) – Consists of header, sections, relocation tables, strings –no global semantics of byte symbols – Each can appear at almost arbitrary palce – high translational variance
  • 97. Dataset – 20 million of executables • Less files with compressed/encrypted machine code • Between 12kB and 1/2 MB
  • 98. Dataset – 20 million of executables • Less files with compressed/encrypted machine code • Between 12kB and 1/2 MB • Temporal split: 1.1.2016 1.1.2017 week 8 week 16 training validation test
  • 99. Dataset – 20 million of executables • Less files with compressed/encrypted machine code • Between 12kB and 1/2 MB • Temporal split: 1.1.2016 1.1.2017 week 8 week 16 training validation test
  • 100. Dataset – 20 million of executables • Less files with compressed/encrypted machine code • Between 12kB and 1/2 MB • Temporal split: 1.1.2016 1.1.2017 week 8 week 16 training validation test • roughly balanced clean and malware classes
  • 101. Architecture Fully Connected Fully Connected Fully Connected Fixed Embedding Conv 32 (stride 4) Conv 32 (stride 4) Max pooling 4 Conv 16 (stride 8) Conv 16 (stride 8) Fully Connected Fully Connected Fully Connected Global Average Fully Connected Fixed Embedding Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Executable as sequence of N bytes
  • 102. Architecture Fully Connected Fully Connected Fully Connected Fixed Embedding Conv 32 (stride 4) Conv 32 (stride 4) Max pooling 4 Conv 16 (stride 8) Conv 16 (stride 8) Fully Connected Fully Connected Fully Connected Global Average Fully Connected Fixed Embedding Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Executable as sequence of N bytes Fixed embedding: byte → (±1/16,. . .,±1/16) ∈ R8 according to the byte’s bits
  • 103. Architecture Fully Connected Fully Connected Fully Connected Fixed Embedding Conv 32 (stride 4) Conv 32 (stride 4) Max pooling 4 Conv 16 (stride 8) Conv 16 (stride 8) Fully Connected Fully Connected Fully Connected Global Average Fully Connected Fixed Embedding 8 × N Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Executable as sequence of N bytes Fixed embedding: byte → (±1/16,. . .,±1/16) ∈ R8 according to the byte’s bits
  • 104. Architecture 192×N/ 4·4·4·8·8 4096 Fully Connected Fully Connected Fully Connected Fixed Embedding Conv 32 (stride 4) Conv 32 (stride 4) Max pooling 4 Conv 16 (stride 8) Conv 16 (stride 8) 96×N/16 96×N/64 128×N/512 Fully Connected Fully Connected Fully Connected Global Average Fully Connected Fixed Embedding 8 × N 48×N/4 Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Executable as sequence of N bytes 96×N/64 Fixed embedding: byte → (±1/16,. . .,±1/16) ∈ R8 according to the byte’s bits
  • 105. Architecture 192×N/ 4·4·4·8·8 4096 Fully Connected Fully Connected Fully Connected Fixed Embedding Conv 32 (stride 4) Conv 32 (stride 4) Max pooling 4 Conv 16 (stride 8) Conv 16 (stride 8) 96×N/16 96×N/64 128×N/512 Fully Connected Fully Connected Fully Connected Global Average Fully Connected Fixed Embedding 8 × N 48×N/4 Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Executable as sequence of N bytes 96×N/64 Fixed embedding: byte → (±1/16,. . .,±1/16) ∈ R8 according to the byte’s bits Power-of-two strides: improve speed and performance
  • 106. Architecture 192×N/ 4·4·4·8·8 4096 Fully Connected Fully Connected Fully Connected Fixed Embedding Conv 32 (stride 4) Conv 32 (stride 4) Max pooling 4 Conv 16 (stride 8) Conv 16 (stride 8) 96×N/16 96×N/64 128×N/512 Fully Connected Fully Connected Fully Connected Global Average Fully Connected Fixed Embedding 8 × N 48×N/4 Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Executable as sequence of N bytes 96×N/64 Fixed embedding: byte → (±1/16,. . .,±1/16) ∈ R8 according to the byte’s bits Power-of-two strides: improve speed and performance strides 3,5,7,9 instead of 4,4,8,8 harm performance by 6-10%
  • 107. Architecture 192×N/ 4·4·4·8·8 4096 Fully Connected Fully Connected Fully Connected Fixed Embedding Conv 32 (stride 4) Conv 32 (stride 4) Max pooling 4 Conv 16 (stride 8) Conv 16 (stride 8) 96×N/16 96×N/64 128×N/512 Fully Connected Fully Connected Fully Connected Global Average Fully Connected Fixed Embedding 8 × N 48×N/4 Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Fully Connected Executable as sequence of N bytes 96×N/64 Fixed embedding: byte → (±1/16,. . .,±1/16) ∈ R8 according to the byte’s bits Power-of-two strides: improve speed and performance strides 3,5,7,9 instead of 4,4,8,8 harm performance by 6-10% projects variably-wide matrix to a fixed-sized vector
  • 108. Architecture 192×N/ 4·4·4·8·8 4096 Fully Connected Fully Connected Fully Connected Fixed Embedding Conv 32 (stride 4) Conv 32 (stride 4) Max pooling 4 Conv 16 (stride 8) Conv 16 (stride 8) 96×N/16 96×N/64 128×N/512 192 192 160 128 2 Fully Connected Fully Connected Fully Connected Global Average 192 192 160 128 2 Fully Connected Fixed Embedding 8 × N 48×N/4 Fully Connected Fully Connected Fully Connected 192 160 128 2 Fully Connected Fully Connected Fully Connected 192 160 128 2 Fully Connected Fully Connected Fully Connected Fully Connected 192 160 128 2 Fully Connected Fully Connected Fully Connected 192 160 128 2 Fully Connected Executable as sequence of N bytes 96×N/64 Fixed embedding: byte → (±1/16,. . .,±1/16) ∈ R8 according to the byte’s bits Power-of-two strides: improve speed and performance strides 3,5,7,9 instead of 4,4,8,8 harm performance by 6-10% projects variably-wide matrix to a fixed-sized vector
  • 109. Architecture 192×N/ 4·4·4·8·8 4096 Fully Connected Fully Connected Fully Connected Fixed Embedding Conv 32 (stride 4) Conv 32 (stride 4) Max pooling 4 Conv 16 (stride 8) Conv 16 (stride 8) 96×N/16 96×N/64 128×N/512 192 192 160 128 2 Fully Connected Fully Connected Fully Connected Global Average 192 192 160 128 2 Fully Connected Fixed Embedding 8 × N 48×N/4 Fully Connected Fully Connected Fully Connected 192 160 128 2 Fully Connected Fully Connected Fully Connected 192 160 128 2 Fully Connected Fully Connected Fully Connected Fully Connected 192 160 128 2 Fully Connected Fully Connected Fully Connected 192 160 128 2 Fully Connected Executable as sequence of N bytes 96×N/64 Training:7-fold influence of clean on loss, mild weight decay, Adam,... Fixed embedding: byte → (±1/16,. . .,±1/16) ∈ R8 according to the byte’s bits Power-of-two strides: improve speed and performance strides 3,5,7,9 instead of 4,4,8,8 harm performance by 6-10% projects variably-wide matrix to a fixed-sized vector
  • 110. Evaluation • Evaluation in the regime of low false positives (formally area under the receiver operator curve restricted to [0, 0.001] – AUC|<0.001)
  • 111. Evaluation • Evaluation in the regime of low false positives (formally area under the receiver operator curve restricted to [0, 0.001] – AUC|<0.001) False Positives Rate True Positives Rate
  • 112. Evaluation • Evaluation in the regime of low false positives (formally area under the receiver operator curve restricted to [0, 0.001] – AUC|<0.001) False Positives Rate True Positives Rate • Evaluation score matters: Gobal Average (instead of Max), strong emphasis of clean files,...
  • 113. Competing architecture (dataset matters) “MalConv” convnet by Raff et al. (Univ. Maryland + NVIDIA)
  • 114. Competing architecture (dataset matters) EmbeddingFixed Embedding 8 × N EmbeddingGlobal Max 128 × (N/512) Fully Connected 128 2 Fully Connected Gated Conv 512 (stride 512) “MalConv” convnet by Raff et al. (Univ. Maryland + NVIDIA)
  • 115. Competing architecture (dataset matters) AUC[0,0.001] Our architecture 0.704 ± 0.005 MalConv (competitor) 0.661 ± 0.009 EmbeddingFixed Embedding 8 × N EmbeddingGlobal Max 128 × (N/512) Fully Connected 128 2 Fully Connected Gated Conv 512 (stride 512) “MalConv” convnet by Raff et al. (Univ. Maryland + NVIDIA)
  • 116. Automatic vs. hand-crafted features 550 hand-crafted in-house features (last year @MLP)
  • 117. Automatic vs. hand-crafted features 550 hand-crafted in-house features (last year @MLP) feed to a 5-layer feedforward net (same set of samples):
  • 118. Automatic vs. hand-crafted features 550 hand-crafted in-house features (last year @MLP) feed to a 5-layer feedforward net (same set of samples): AUC|<0.001 convolution features 70.4 ± 0.5% hand-crafted features 73.2 ± 2.3% (similar in accur. and x-entropy)
  • 119. Automatic vs. hand-crafted features 550 hand-crafted in-house features (last year @MLP) feed to a 5-layer feedforward net (same set of samples): AUC|<0.001 convolution features 70.4 ± 0.5% hand-crafted features 73.2 ± 2.3% (similar in accur. and x-entropy) ensambled features 76.1 ± 1.0% (much better accur and x-entr.)
  • 120. Automatic vs. hand-crafted features 550 hand-crafted in-house features (last year @MLP) feed to a 5-layer feedforward net (same set of samples): AUC|<0.001 convolution features 70.4 ± 0.5% hand-crafted features 73.2 ± 2.3% (similar in accur. and x-entropy) ensambled features 76.1 ± 1.0% (much better accur and x-entr.) Convnets slightly below Avast’s know-how, but already good at feature enrichment
  • 121. Automatic vs. hand-crafted features 550 hand-crafted in-house features (last year @MLP) feed to a 5-layer feedforward net (same set of samples): AUC|<0.001 convolution features 70.4 ± 0.5% hand-crafted features 73.2 ± 2.3% (similar in accur. and x-entropy) ensambled features 76.1 ± 1.0% (much better accur and x-entr.) Convnets slightly below Avast’s know-how, but already good at feature enrichment – Dataset easier for convnets
  • 122. Automatic vs. hand-crafted features 550 hand-crafted in-house features (last year @MLP) feed to a 5-layer feedforward net (same set of samples): AUC|<0.001 convolution features 70.4 ± 0.5% hand-crafted features 73.2 ± 2.3% (similar in accur. and x-entropy) ensambled features 76.1 ± 1.0% (much better accur and x-entr.) Convnets slightly below Avast’s know-how, but already good at feature enrichment – Dataset easier for convnets + Improvement potential, transferable to other domains
  • 124. Explainability • grad-CAM (Class Activation Map): which “pixels” of the last conv layer caused the prediction?
  • 125. Explainability • grad-CAM (Class Activation Map): which “pixels” of the last conv layer caused the prediction?
  • 127. Byte-level explanations: Guided Backprop • header of an embedded PE
  • 128. Byte-level explanations: Guided Backprop • header of an embedded PE • “VERSION_INFO” with a fake vendor and software name
  • 129. Byte-level explanations: Guided Backprop • header of an embedded PE • “VERSION_INFO” with a fake vendor and software name
  • 130. • unusual imported functions Byte-level explanations: Guided Backprop • header of an embedded PE • “VERSION_INFO” with a fake vendor and software name
  • 131. Future work • Improve speed (separable convolutions, mixture of experts)
  • 132. Future work • Improve speed (separable convolutions, mixture of experts) • More diverse and larger dataset
  • 133. Future work • Improve speed (separable convolutions, mixture of experts) • More diverse and larger dataset • Apply to (other file types relevant for) network traffic
  • 134. Future work • Improve speed (separable convolutions, mixture of experts) • More diverse and larger dataset • Apply to (other file types relevant for) network traffic Questions?
  • 136. Convolutional Nets input aligned in Euclidean space
  • 137. Convolutional Nets input aligned in Euclidean space 1D (sequence) x1 xn ...
  • 138. Convolutional Nets input aligned in Euclidean space 1D (sequence) x1 xn ... convolutional layer: ...
  • 139. Convolutional Nets input aligned in Euclidean space 1D (sequence) x1 xn ... convolutional layer: ...
  • 140. Convolutional Nets input aligned in Euclidean space 1D (sequence) x1 xn ... convolutional layer: ...
  • 141. Convolutional Nets input aligned in Euclidean space 1D (sequence) x1 xn ... θ1 θ3 θ1 θ3 convolutional layer: ...
  • 142. Convolutional Nets input aligned in Euclidean space 1D (sequence) x1 xn ... θ1 θ3 θ1 θ3 convenient to implement some invariance on translation convolutional layer: ...