Black-Box attacks against Neural Networks - technical project presentation

P R A C T I C A L B L A C K -
B O X AT TA C K S
A G A I N S T M A C H I N E
L E A R N I N G
S A P I E N Z A
U N I V E R S I T Y O F R O M E
M S C I N E N G I N E E R I N G I N
C O M P U T E R S C I E N C E
N E U R A L N E T W O R K S
A Y 2 0 1 8 / 1 9
S U B M I T T E D T O
P R O F . A . U N C I N I
W O R K B Y
R . F A L C O N I , S . C L I N C I U

INTRODUCTION
The objective of our presentation is to explain and
demonstrate what a black box attack against deep
neural networks (DNN) classifiers is, how to
implement it and show some practical examples.

BLACK-BOXThe goal of the adversary is to
force a classifier to misclassify
inputs in any class different from
their correct class. It is called
black-box attack because the
adversary has access to the DNN
output only. The adversary has no
knowledge of the architectural
choices made to design the DNN,
which include the number, type
and size of layers, nor the training
data used to learn the DNN’s
parameters. Such attacks are
referred to as black box, where
adversaries need not know
internal details of a system to
Target model: we consider
attackers targeting a multiclass
DNN classifier. It outputs
probability vectors, where each
vector component encodes the
DNN’s belief of the input being
part of one of the predefined
classes. We consider the ongoing
example of a DNN classifying
images. Such DNNs can be used
to classify handwritten digits into
classes associated with digits from
0 to 9, images of objects in a fixed
number of categories, or images
of traffic signs into classes
identifying its type (STOP, yield,

E X A M P L E S O F
A D V E R S A R I A L
C A P A B I L I T I E S

THREAT MODEL
The threat model corresponds to the
real-world scenario of users interacting
with classifiers hosted remotely by a
third-party keeping the model internals
secret.
Black-box attack is applicable to many remote
systems taking decisions based on ML, because it
combines three key properties: the capabilities
required are limited to observing output class
labels, the number of labels queried is limited, and
the approach applies and scales to different ML
classifier types, in addition to state-of-the-art
DNNs.

BLACK BOX ATTACK STRATEGY
• Use the target DNN as an oracle to construct a synthetic dataset: The inputs are
synthetically generated and the outputs are labels observed from the oracle.
• Use the synthetic dataset created to build an approximation F of the model O learned by
the oracle.
• Use this substitute network F to craft adversarial samples, as long as the transferability
property holds between F and O adversarial samples crafted for F will also be misclassified
by O.
The strategy can be summarized in two steps:
1. Substitute Model Training
2. Adversarial Sample Crafting

SUBSTITUTE MODEL TRAINING
Training a substitute model F is challenging because we must select an architecture for
our substitute without knowledge of the targeted oracle’s architecture and limit the
number of queries made to the oracle in order to ensure that the problem is tractable.
To overcome these challenges a synthetic data generation technique was introduced, the
Jacobian-based Dataset Augmentation

The adversary must have some partial knowledge of the oracle input (images, text, …)
and expected output so he can use an architecture adapted to the input-output relation.
For instance a convolutional neural network is suitable for image classification.
SUBSTITUTE ARCHITECTURE

We could make an infinite number of queries to obtain the oracle’s output O( 𝑥) for any
input 𝑥 belonging to the input domain and this would provide us with a copy of the
oracle. However this is simply intractable.
To address this issue a heuristic efficiently exploring the input domain was introduced.
The heuristic used to generate synthetic training inputs is based on identifying directions
in which the model’s output is varying, around an initial set of training data. These
directions are identified with the substitute DNN’s Jacobian matrix 𝐽 𝐹, which is evaluated
at several input points 𝑥.
More precisely the adversary calculates sgn 𝐽 𝐹 𝑥 𝑂 𝑥 .
To obtain a new synthetic a term 𝜆 sgn 𝐽 𝐹 𝑥 𝑂 𝑥 is added to the original point 𝑥.
GENERATING A SYNTHETIC DATASET

• Initial collection: The adversary collects a very small set of inputs representative of the input
domain
• Architecture Selection: The adversary selects an architecture to be trained as the substitute
F
• Substitute Training: The adversary iteratively trains more accurate substitute DNNs 𝐹𝑝 by
repeating the following 𝑓𝑜𝑟 𝑝 ∈ 0 … 𝑝max :
– Labeling 3: the adversary labels each sample 𝑥 ∈ 𝑆 𝑝 in its initial substitute training set 𝑆 𝑝;
– Training 4: The adversary trains the architecture chosen using the substitute training set 𝑆 𝑝;
– Augmentation: The adversary applies the Jacobian based dataset augmentation on the initial
substitute training 𝑆 𝑝 to produce a larger substitute training set 𝑆 𝑝+1. The new training set better
represents the model’s decision boundaries. The adversary repeats steps 3 and 4 with the
augmented set 𝑆 𝑝+1.
SUBSTITUTE DNN TRAINING ALGORITHM

ADVERSARIAL SAMPLE CRAFTING
Adversarial sample crafting: the attacker uses substitute network F to craft adversarial samples, which are then misclassified by
oracle O due to the transferability of adversarial samples. Once the adversary trained a substitute DNN, it uses it to craft adversarial
samples. We provide an overview of two approaches. Both share a similar intuition of evaluating model’s sensitivity to input
modification to select small perturbation achieving misclassification goal.
Goodfellow et al. algorithm (also known as Fast Gradient Sign Method or FGSM).
Papernot et al. algorithm (also known as Jacobian-based Saliency Map Attack or JSMA).

GOODFELLOW
ET AL.
ALGORITHM
(FGSM)
Goodfellow gives a model F with an associated cost
function 𝑐 𝐹, 𝑥, 𝑦 , the adversary crafts an adversarial
sample 𝑥∗ = 𝑥 + 𝛿 𝑥 for a given legitimate sample 𝑥 by
computing the following perturbation:
𝛿 𝑥 = 𝜖 𝑠𝑖𝑔𝑛 ∇ 𝑥 𝑐 𝐹, 𝑥, 𝑦 where the perturbation sign
is the sign of the model’s cost function gradient (from
where it takes the name of Fast Gradient Sign
Method), computed with respect to 𝑥 using sample 𝑥
and label y as inputs.

PAPERNOT ET
AL.
ALGORITHM
(JSMA)
Papernot algorithm is suitable for source-target
misclassification attacks where adversaries seek to take
samples from any legitimate source class to any chosen
target class. Misclassification attacks are a special case of
source-target misclassifications, where the target class
can be any class different from the legitimate source
class. Given model F, the adversary crafts an adversarial
sample 𝑥∗
= 𝑥 + 𝛿 𝑥 for a given legitimate sample 𝑥 by
adding a perturbation 𝛿 𝑥 to a subset of the input
components 𝑥𝑖.
Each algorithm has its benefits and drawbacks. The
Goodfellow algorithm is well suited for fast crafting of
many adversarial samples with relatively large
perturbations thus potentially easier to detect. The
Papernot algorithm reduces perturbations at the
expense of a greater computing cost.

MNIST AND
CIFAR
DATASETS
To validate the attack, we tried it against different classifiers and
using also different types of attack. We first made an FGSM attack
to target DNN trained using MNIST dataset, then we made another
attack against a DNN trained with CIFAR dataset, both attacks have
the goal to misclassify most of adversarial examples crafted with a
perturbation not affecting human recognition. Finally, we repeat
both the attack using a JSMA type of attack.
The MNIST database (Modified National Institute of Standards and
Technology database) is a large database of handwritten digits that
is commonly used for training various image processing systems, it
is widely used for training and testing in the field of machine
learning.
The CIFAR dataset (Canadian Institute For Advanced Research) is a
collection of images that are commonly used to train machine
learning and computer vision algorithms. The CIFAR dataset
contains 60,000 32x32 color images in 10 different classes. The 10
different classes represent airplanes, cars, birds, cats, deer, dogs,
frogs, horses, ships, and trucks. There are 6,000 images of each
class.

ATTACK
VALIDATION
Both MNIST and CIFAR are two of the most widely
used datasets for machine learning research.
The goal is to verify whether these samples are also
misclassified by the oracle or not. Therefore, the
transferability of adversarial samples refers to the
oracle misclassification rate of adversarial samples
crafted using the substitute DNN.

GENERALIZATION
OF THE ATTACK
Substitutes and oracles taken in cause were learned
with DNNs, but the attack bounds its applicability to
other ML systems. For examples, substitutes can also
be learned with logistic regression and the attack
generalizes to additional ML models, but the same
accuracy and efficiency is not guarantee.

DEFENCE STRATEGIES
According to Goodfellow and Papernot there are two types of defense strategies:
reactive, also known as adversarial training, where one seeks to detect adversarial
samples, and proactive, also called defensive distillation, where one makes the model
itself more robust.
Adversarial training seeks to improve the generalization of a model when presented with
adversarial examples at test time by proactively generating adversarial examples as part
of the training procedure. It is not yet practical because Goodfellow et al. showed how to
generate adversarial examples inexpensively with the fast gradient sign method and
made it computationally efficient to generate large batches of adversarial examples
during the training process.
ADVERSARIAL TRAINING (REACTIVE)

DEFENCE STRATEGIES
Defensive distillation smooths the model’s decision surface in adversarial directions
exploited by the adversary.
Distillation is a training procedure where one model is trained to predict the probabilities
output by another model that was trained earlier. It may seem counterintuitive to train
one model to predict the output of another model that has the same architecture.
The reason it works is that the first model is trained with “hard” labels (100% probability
that an image is a dog rather than a cat) and then provides “soft” labels (95% probability
that an image is a dog rather than a cat) used to train the second model. The second
distilled model is more robust to attacks such as
DEFENSIVE DISTILLATION (PROACTIVE)

CONCLUSIONS
Our implementation reflects what the paper is about. The authors show a work
based on a novel substitute training algorithm using synthetic data generation, to
craft adversarial examples misclassified by black-box DNNs.
The study of adversarial examples is exciting because many of the most important
problems remain open, both in terms of theory and in terms of applications. On the
theoretical side, no one yet knows whether defending against adversarial examples is
a theoretically hopeless endeavor (like trying to find a universal machine learning
algorithm) or if an optimal strategy would give the defender the upper ground (like
in cryptography and differential privacy). On the applied side, no one has yet
designed a truly powerful defense algorithm that can resist a wide variety of
adversarial example attack algorithms.
Defending against finite perturbations is a more promising avenue for future works.

HOW TO RUN THE CODE
To run the code is very easy, everything needed is to:
a. Clone the GitHub repository using the command ‘git clone
https://github.com/RobertoFalconi/BlackBoxAttackDNN’
b. Access the repository with ‘cd BlackBoxAttackDNN’
c. Use the command ‘pip3 install <framework name> to import each required library
d. Run FGSM strategy with ‘python FastGradientSignMethods’ or JSMA strategy with
the code ‘python JacobianSaliencyMapApproach’.
Tested on Python 3.7.3 64-bit edition and NVIDIA 425.31 drivers, using a GeForce RTX
2080.

RUNNING CODE
github.com/RobertoFalconi/BlackBoxAttackDNN
github.com/Clincius

REFERENCES
1. Ian Goodfellow and Nicolas Papernot. Is attacking machine learning easier than defending it?
http://www.cleverhans.io/security/privacy/ml/2017/02/15/why-attacking-machine-learning-is-
easier-than-defending-it.html
2. Alexey Kurakin, Ian J. Goodfellow, Samy Bengio. Adversarial Examples in the physical world.
[Online] 2017. https://arxiv.org/pdf/1607.02533.pdf.
3. Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z. Berkay Celik,
Ananthram Swami. Practical Black-Box Attacks against Machine Learning. [Online] 2017.
https://arxiv.org/pdf/1602.02697.pdf.
4. Ian J. Goodfellow, Jonathon Shlens, Christian Szegedy. Explaining and harnessing adversarial
examples. [Online] 2015. https://arxiv.org/pdf/1412.6572.pdf.
5. Anish Athalye, Nicholas Carlini, David Wagner. Obfuscated Gradients Give a False Sense of
Security: Circumventing Defenses to Adversarial Examples. [Online] 2018.
https://arxiv.org/pdf/1802.00420v4.pdf.
6. Nicolas Papernot. Gradient Masking in Machine Learning.
https://seclab.stanford.edu/AdvML2017/slides/17-09-aro-aml.pdf

T H A N K Y O U !
https://www.linkedin.com/in/roberto-
falconi
https://www.linkedin.com/in/stefan-
clinciu-7421b2a6

Black-Box attacks against Neural Networks - technical project presentation

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Black-Box attacks against Neural Networks - technical project presentation

Ähnlich wie Black-Box attacks against Neural Networks - technical project presentation (20)

Mehr von Roberto Falconi

Mehr von Roberto Falconi (15)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Black-Box attacks against Neural Networks - technical project presentation