Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Martínez at Big Data Spain 2017

Attacking Machine Learning used in
AntiVirus with RL
Datahack

# whoami Rubén Martínez Sánchez
• Twitter: @eldarsilver
• Computer Engineer (Universidad Politécnica Madrid)
• Security Researcher (Pentester)
• Certified Etical Hacker (CEH)
• Member of MundoHacker (TV Show)
• Master Data Science Datahack
• Cloudera Developer Training for Apache Spark
• Cloudera Developer Training for Apache Hadoop

Agenda# ls()
• Static Malware Analysis
• Reinforcement Learning (RL)
• Antivirus Evasion using RL
• Demo Antivirus Evasion using RL

# cat Static_Malware_Analysis
• Definition

Static Malware Analysis is usually performed by dissecting the different
resources of the binary file without executing it and studying each component.
The binary file can also be disassembled (or reverse engineered) using a
disassembler such as IDA or radare. (Wikipedia)

Search for signatures in the executable.

# cat Static_Malware_Analysis
• Portable Executable (PE)

The Portable Executable (PE) file format is a data structure that contains the
information necessary for the Windows OS loader to manage the wrapped
executable code.

PE File Format by Saurabh & Chinmaya

# cat Reinforcement_Learning
• Definitions

A Reinforcement Learning model consists of an angent and an environment.

For each turn, an agent receives a state and may choose one from a set of
actions .

The policy is the agent’s behavior, i.e., a mapping from states to actions .

The agent receives the next state and a scalar reward .

http://www.ausy.tu-darmstadt.de/Research/Research
Α

• Definitions

Immediate rewards are generally not very helpful while learning a game. So,
what we should aim for is long term rewards.

The long term reward of step t will be:

The agent aims to maximize the expectation of such long term return from
each state.

The parameter is the discount factor that defines the weight of distant
rewards in relation to those obtained sooner.

The discounting by ensured that this sum is finite.

• Q value

The optimal action-value function: Q value

A Neural Network will be used to approximate this function.

Next we can define the policy to choose an action.

The Loss function to update the Network:
http://web.stanford.edu/class/cs20si/lectures/slides_14.pdf

• Actor-Critic Algorithms

The actor produces an action given the current state of the environment.

The critic produces a TD (Temporal-Difference) error signal given the state and
resultant reward.

If the critic is estimating the action-value function Q(s,a), it will also need the
output of the actor.

The output of the critic drives learning in both the actor and the critic.

In Deep Reinforcement Learning, neural networks can be used to represent the
actor and critic structures.

# cat Antivirus_Evasion_Using_RL
• Overview

The environment → the malware sample.

The environment emits the state in the form of a 2350-dimensional feature
vector:

PE header metadata.

Section metadata: section name, size and characteristics.

Import & Export Table metadata.

Counts of human readable strings.

Byte histogram.

• Overview

The agent → the algorithm used to change the environment.

The agent sends actions to the environment, and the environment replies with
observations and rewards (that is, a score).

There will be an anti-malware engine (the attack target).

Each step will provide:

Reward: value of reward scored by the previous action. 10.0 (pass), 0.0 (fail).

Observation space (object): feature vector summarizing the composition of
the malware sample.

Done(bool): Determines whether environment needs to be reset; True means
episode was successful.

• Overview

The actions that can be performed on a malware sample in our environment
consist of the following binary manipulations:
* append_zero
* append_random_ascii
* append_random_bytes
* remove_signature
* upx_pack
* upx_unpack
* change_section_names_from_list
* change_section_names_to random
* modify_export
* remove_debug
* break_optional_header_checksum

Over time, the agent learns which combinations lead to the highest rewards, or
learns a policy.

Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Martínez at Big Data Spain 2017

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Martínez at Big Data Spain 2017

Ähnlich wie Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Martínez at Big Data Spain 2017 (20)

Mehr von Big Data Spain

Mehr von Big Data Spain (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Martínez at Big Data Spain 2017