Convolutional Neural Networks

1
Convolutional Neural Networks
And Facial Recognition
By Taylee Gray
May 16th, 2019
Towson University
MATH 490

2
Table of Contents
Table of Contents 2
Introduction 3
Example Applications 4
Inputs, Labels, Outputs 5
A Description of the Model Function 5
Description of Stochastic Gradient Descent 7
Description of Backpropagation 8
How Pooling Affects Complexity 9
Demonstration 10
References 12

3
Introduction
The Olivetti dataset from AT&T Laboratories of Cambridge University is the dataset that
will help us conclude the importance of Convolutional Neural Networks (ConvNets or CNNs) in
facial recognition. This dataset portrays ten different portraits of the same person with a distinct
set of forty individuals. With this dataset we will be showing how CNNs are most effective in
image recognition and classification.
For starters, an Artificial Neural Network (ANN) is a computational model that is
inspired by the way biological neural networks in the human brain process information
(Ujjwalkarn). When referring to a neural network, the basic unit of computation is the neuron.
These neurons are usually called nodes. A single neuron receives input from other nodes to
which the neural network then creates an output. All inputs have an associated weight to
compute the weighted sum output. Determining the weights usually depends on the relative
importance to other inputs on the assigned basis.
Input 1→ →Output
Input 2→
The neural network above computes the numerical input from 𝑥1 and 𝑥2 along with the
corresponding weights as well as the input of 1 with weight 𝑏 (known as our bias) to get an
output. Of the many activation functions, the sigmoid is the most common in a NN. The sigmoid
function takes a real-valued input and fits that input into a range between zero and one. As for a
CNN, typically the ReLU function is more prominent.
Convolutional Neural Networks are a category of neural networks. These networks get
more in depth hence the concept of deep learning. Deep neural networks are powerful algorithms
often much harder to train than shallow. In 1994, one of the first ConvNets was pioneered by
Yann LeCun (Ujjwalkarn). This propelled the field of deep learning and was later named LeNet5.
LeNet5 was architecture by a sequence of three layers:
● 32𝑥32 input layer Convolutions and subsampling layers consisting of:
● 6 different 28𝑥28 feature maps
● 16 different 5𝑥5 feature maps Fully connected 2D layer
● 120 outputs to 84 outputs to the final output of 10
f(w1x1+w2x2+b)
1
x1
x2
Y
b
w 1
w 2

4
Without a visual it is still clear to see that the original input is the largest which is then
decomposed by pooling and nonlinearity to extract specific features eventually giving us the
optimal output.
Example Applications
Time and technology go hand in hand and as they progress we are thrown with both
positives and negatives. CNNs can be used for the greater good for instance, by automatically
detecting cancer in endoscopic images. Studies by Yuma Endo and other engineers at AI Medical
Service, Inc. in Japan presented a CNN that was trained using 13,584 endoscopic images of
gastric cancer (Hirasawa). In order to improve the accuracy of the CNN they also trained an
independent test set of 2296 stomach images. From these images, 77 gastric cancer lesions were
applied to the CNN from 69 consecutive patients. As a result, the CNN correctly diagnosed 92%
of its patients so we can conclude that this could be well applicable to the medical field reducing
the burden of endoscopists. In general, the medical field is always advancing and what is popular
at the moment is how CNNs are used to dispense medication based on a facial scan,
Facial recognition is used worldwide in our everyday lives. It has been successful in
catching criminals by the use of surveillance cameras. Any time a person goes missing there is a
limited period of time to find them before the odds reduce significantly. As previously stated, the
use of surveillance cameras in these situations are positively effective when combined with facial
detection software like CNNs. Apple made its huge debut with the iPhone X having facial
recognition unlock the device itself. Also, pinpointing terrorists by the use of facial recognition
helps minimize the possibility of attacks; the list goes on and on.
However, on the opposite spectrum of things, is facial recognition ethical? Despite the
many positives of facial recognition, CNNs invoke plenty criticism regarding the legality and
ethic standpoint. Facebook currently faces a lawsuit over its own facial recognition technology
called, DeepFace. This technology identified people in photos without their consent. Amazon's
smart home company Ring also came under fire for the same violation of civil rights.
As of May 15th, 2019, San Francisco banned facial recognition technology making it the
first city in the United States to have such a restriction. The disadvantage of these networks is
simply the bias, bias referring to people being falsely accused and recognized. This bias also
references to people of color and inaccurately exposes them. According to CBS News, twenty-
eight members of congress falsely matched up with mugshots of criminals. The ban will not
apply to federal use due to security reasons. All in all, it is hard to depict a line between the good
and bad in ethics when it comes to facial recognition in CNNs.

5
Inputs, Labels, Outputs
Our input for our model will be a total of 400 images. There are 40 people in our dataset
each with 10 pictures of them. The photos are of size 64x64 and grayscale so that we don’t have
to use a three dimensional input for our model. The images do not show large groups of people
or whole shots of a person’s body. They just contain the subject’s face with no background
showing. The labels for the model are the names of each person in our subject group, or more
precisely a number associated with that person.
The output layer of the model is represented by 40 nodes. Each node represents a person
and their likelihood of the input image being that person. We do this by using the softmax
activation function. The softmax function takes an input vector and normalizes it into a
probability distribution consisting of 40 probabilities. The softmax function is given by:
(Where is the sigmoid and is the kronecker delta function)
Using the softmax we can create a probability distribution across all output nodes of the
likelihood of the image being associated with it. The node with the largest probability ends up
being the node that the Neural Network “selects”. Therefore, it selects one out of 40 names to be
the label.
A Description of the Model Function
There are four primary portions of a Convolutional Neural Network that differentiates it
from a traditional Fully Connected Neural Network. CNNs include Convolutional Layers, and
Pooling Layers. These layers are introduced in the beginning of the network so that the image
that is being used as input does not have to be used as is in the Fully Connected Layer. Typically
(though not exclusively) CNNs use the ReLU activation function. They also use something
called Dropout to protect from overfitting, and a Flattening Function to connect the
preprocessing to a traditional NN. CNNs are useful because it is unwise to use a raw image as
the first layer of a traditional NN due to The Curse of Dimensionality. The Curse of
Dimensionality states that as the dimensions of your input increases, so should the number of
data points (exponentially) that way you can “fill” the feature space adequately (Shetty, 2019).
In the case of facial recognition an image tends to be very large, and therefore the feature space
tends to be extremely large. A pixel image creates a dimension space and
three times that if you have a colored image (one dimension for each color channel in the image).
If it is not feasible to get more data points the only option is to implement a kind of
preprocessing to decrease complexity. This is the basis of CNNs.

6
The first step of a CNN is its namesake the Convolutional Layer. The purpose of
convolution is to extract features from the image while maintaining a kind of spatial relationship
between the pixels in the image. To convolve an image, a “filter” or “kernel” is used. Call it .
The filter is a matrix of some predetermined size that is less than the size of the input
image with size . The filter also has a “stride” , or the number of pixels
moved per step either by row or column. The alse filter has randomly initialized numbers as the
parameters of the matrix that will later be learned via backpropagation. Lastly there is the
number of filters used in a layer .
The first step of convolution is to place the top left of your filter at the top left of your
image. Call the portion of that overlaps . Then compute the sum of the element-wise
product of the two overlapping matrices. In other words:
.
This is the element of the convolved feature. Next, take a single stride to the right and repeat
this process. Striding right increments the row of the convolved feature. When your filter hits
the end of you input image, reset the filter to the beginning of the row and take a stride down.
Increment the column of the convolved feature. Continue until the bottom of the image is
reached. If you have a three dimensional feature set (a colored image), you should repeat this
process for each slice of the third dimension.
This process results in a convolved feature of size:
.
After a convolutional layer, it is convention to use the Rectified Linear Unit (or ReLU) as
our activation function. ReLU is defined as:
.
Previously, we had been using the Sigmoid function as our activation function. The reason
ReLU is far more commonly used is that Sigmoid suffers from something called the vanishing
gradient.
Because ,
It can be shown that .
Repeatedly multiplying small numbers over multiple layers will result in a smaller and smaller
gradient that eventually will yield no change to your model. On the other hand, using ReLU, the
gradient is zero if and one if . (Because ReLU is non differentiable on it is
said the derivative is zero in practice.) Because of this, multiple gradients of the the ReLU

7
neither explode nor vanish, so it makes a good activation function for a model with many layers
like a CNN.
The Pooling Layer acts similarly to the Convolutional Layer, where it reduces the
dimensionality of the input while trying to maintain the important information of the feature
map. By reducing the dimensionality of the input it is more easily computable, better protected
against overfitting, and less likely to be affected by tiny transformations distortions to the input.
Like the convolutional filter, pooling has a filter as well, with a size but unlike a
convolutional filter it does not have parameters to be changed by backpropagation. Instead, it
uses a function to get its output number. The most popular and effective option is max-pooling,
where the element of the output matrix is the largest number inside the pooling filter. There also
exist sum and average functions that can be used.
While pooling can help fight overfitting, it is not always successful. That is where we
can implement dropout. Dropout is a “regularization method that approximates training a large
number of neural networks with different architectures in parallel (Brownlee, 2019).” To do this,
while training the network a random selection of neurons are ignored (dropped out), meaning all
incoming and outgoing connections are ignored. This ends up making noise and error in the
training process that helps keep any one neuron from getting weights that are too large. Very
large weights are a sign that your data has been over fit. Buy thinning out the network using
dropout it means it is possible for us to require more neurons to maintain the same number of
actuated neurons in the network. This process keeps our data from being over fitted.
The last step of a CNN is the flattening. Flattening simply takes our output of
our convolution and pooling layers and transforms it to be taken in as a vector to our fully
connected layer. This can be done by having reading the matrix left-to-right, top-to-bottom and
filling out the one-dimensional vector element-by-element. The result of this is a vector of
length that can be used as the input of our Fully Connected Layer. Once we have this
input vector we can operate our NN as a typical multi-layer perceptron. We are assuming you
have a background in these kinds of NN and so we won’t be going into detail of the structure of
them here.
Description of Stochastic Gradient Descent
In the CNN process the gradient descent optimization algorithm aims to minimize some
cost/loss function based on that function’s gradient. Successive iterations are employed to
progressively approach either a local or global minimum of the cost function..
Recall the gradient of a function is given by:
The gradient can be intuitively thought of as the path of descent a ball would take while
rolling down a hill. However, using neural networks we don’t have access to this function

8
otherwise we would be done! There would be no reason to train a model. Instead we have
access to a loss function which can help us approximate this.
Our loss function is described as the sum of differences between what your model gave
you and what your model should give you:
(This is the Mean Squared Error or MSE.)
Using our loss function we can approximate GD using a subset of our training data. This
is the difference between Stochastic Gradient Descent and Gradient Descent.
The iterative process for SGD is as follows:
● Start at some randomly selected vector
● For each step t, have some process to generate . A subset of our training set.
● Then compute so that we can use to formula to update our vector:
● Here is our predetermined learning rate.
Repeating this process we approach a local minimum of our loss function. Our hope is
that this local minimum would actually be the global minimum so that our model would be
entirely optimized, but this is not guaranteed. Also, because SGD only is applying a subset of
it is smaller than GD and thus gets to the minimum much faster than GD but probably won’t
converge to the minimum. Instead, it will oscillate around the minimum, but the approximation
is good enough (Ng).
Description of Backpropagation
Backpropagation is how a neural network learns by looking at how much error or cost
was calculated and trying to minimize that number. For CNN’s we look at the error between
each of our layers and decided whether or not something matches in that layer and if so how well
it matches. From there the neural network decides what or who it is looking at. For example the
Cost of a single Neuron in a network can be given as:
Where C is the cost function:
,
R is the ReLU, Z is our weighted input (𝑍 = 𝑍𝑍), and W is our weight. Taking the
partial derivative of our error function with respect to the weight we are looking at will show us
how much error contributes to our function, and as such we need to expand our formula using the
chain rule:

9
These partial derivatives are used to check each parameter and how that parameter
contributes to the total change in error. Furthermore as we go through multiple layers our cost
function will have more and more inputs and as such the expansion can get rather lengthy and
cumbersome. However as we go through more weights we use the previous calculated weights
when using the chain rule and as such we have the program remember those values so that it
does not need to recalculate them.
With the information gathered we can no find the actual error of the layer and see how
that is impacting our final layer that is our result. That is the derivative of our Cost function with
respect to the output layer (Zo) and hidden layers (Zh):
We can see the hidden layer error is equivalent to the output layer times our weighted
output times the ReLU of our hidden layer. This is the general process of backpropagation where
we keep shuffling our weighted error back to our previous layer and continue to filter our output
until we arrive at an error that is suitable to the machine to make an informed decision on what
the image is and to put a name to it. How the machine decides to adjust which weights of each
input is derived the SGD above and depending on if one value is more or less will adjust the
weight accordingly to give us the lowest error possible.
How Pooling Affects Complexity
Pooling is an important layer in Convolutional Neural Networks because it decreases the
complexity of your model. By down sampling our input layer by layer we can protect against
overfitting or minor variations in our data that would otherwise look like a completely new
image to a naive network. We also decrease the dimensionality of the network which improves
manageability and runtime
There are several different kinds of functions that can be used in your Pooling Layer that
all perform differently. There is a Sum, Average, and Max Pooling function. In a network it
isn’t uncommon that one might use a combination of these. Sum and Average have similar
results, just scaled differently so the comparison to be had is the difference between Average and
Max. Max pooling takes the largest value over the filter while Average obviously takes the
average. The intuition here is that Max takes the “most important” feature, while Average takes
into consideration all of the values within the filter. According to Arpathy in Convolutional
Neural Networks for Visual Recognition, the most effective pooling function is Max Pooling due

10
to its ability to extract the valuable information from the image and is not affected as much by
small variations.
However, if a Pooling Layer has the same structure as a Convolutional Layer minus the
parameters and adding a predetermined function, would it be feasible to replace it with another
Convolutional Layer in an attempt to get the best of both worlds. The Convolutional Layer still
can down sample the input features but it can use SGD to learn filters that would hopefully
extract more information from the image. In Springberg et. al’s Striving for Simplicity: The All
Convolutional Net they explored this possibility in hopes of finding a more simplistic CNN
model. They used the CIFAR-10 (Krizhevsky & Hinton, 2009) dataset to study different
models.
The first model used was the control model or your typical CNN with Pooling Layers.
The second model used was a All-CNN with an incremented stride on the Convolutional Layers
that precede the removed Pooling Layers. It is important to increment the stride because the next
layer you output to should be accepting an input of the same spatial region as it was in the
control model. The final model was using a replacement of the Pooling Layer with a
Convolutional Layer. After comparing the models results showed an increase in accuracy with
Pooling Layers replaced and a lower Loss as well.
Our expectation was that a network with no Pooling Layers would perform poorly due to
overfitting of the data, which is the main advantage to using Pooling Layers in the first place. In
order to test Springberg’s conclusions we ran tests of our own on a relatively simply structured
network with not overly aggressive overfitting counter measures to see if these All
Convolutional Networks are prone to overfitting.
Demonstration
To demonstrate the effect of no pooling on a Convolutional Neural Networks we started
by first building a typical CNN. We built the CNN using keras which is a high-level python
library built on Google’s TensorFlow for Machine Learning. Our CNN started with ten pictures
each of forty different people from AT&T’s Olivetti dataset. Important to note that the images we
used had been preprocessed to only be the face of the person being photographed. We had tried
using the Labeled Faces in the Wild dataset and found that the images were not close enough to
the subjects face and included too much of the background to create successful a model. There
are ways to process those faces into usable inputs but we opted to change datasets for simplicity.
The structure of our CNN was as follows:
● Convolutional Layer of size 64, with filter size 3 x 3, stride of 1, with the ReLU
activation function
● Max Pooling Layer with filter size 2 x 2, stride of 2
● Another Convolutional Layer
● Another Max Pooling Layer

11
● A 10% Dropout Layer
● The Flattening Function
● A length 64 vector for the first layer of our Fully Connected Network
● A 20% Dropout Layer
● Our output layer with 40 nodes using the softmax activation function.
After we make the structure of the network the next steps are to define the optimizer and
it’s parameters. As we mentioned before out optimizer was Stochastic Gradient Descent with
, as well as a few other tweaking parameters (momentum=1, decay=0.05). After
defining the input, structure, and optimizer the last thing to do is run the Network. We used a
batch size of 10 and ran through around 100 epochs each time.
Here are the graphs of accuracy and loss for a our baseline CNN using SGD and Max
Pooling Layers. After 100 epochs we earned validation accuracy of almost 80% and a validation
loss of near 1. By looking a the graphs we can see that both scores on training and testing data
are fairly close. This is a good indication that we have not overfit our data.
The next step was to remove our pooling layers and see how the model faired. In order to
remove the pooling layers there are a few steps you have to make to maintain the models
integrity. According the Springberg, the way to convert your traditional CNN to and all
Convolutional one is to first change the Pooling Layer into a Convolutional Layer using the same
size and stride as the Pooling Layer was. Next and importantly you have to increase the stride of
the Convolutional Layer preceding the original Pooling Layer by one so the network can
maintain the same feature size. Here were the results after 200 epochs.

12
Our training data without pooling did perform better than with pooling with an accuracy
of almost 90% and a loss of below 0.5. However, our testing data did not do very well at all,
with testing loss rising to over 2.0 at some points. This is a clear case of overfitting our data.
This makes sense because part of the reason pooling is used is to prevent overfitting. While our
loss did fair worse there were some advantages to using a CNN with no pooling. While the
model did learn more slowly, it actually ran drastically quicker. The CNN with pooling took
around 4 seconds per epoch while the CNN without pooling took under half that time. For this
reason we believe that if you used more aggressive anti-overfitting techniques like a higher
dropout percentage, then maybe you could see some real benefit to using an All Convolutional
Network.
References
Arpathy, K. “Convolutional Neural Networks for Visual Recognition.” CS231n
Convolutional Neural Networks for Visual Recognition, cs231n.github.io/convolutional-
networks/.
“Backpropagation.” Backpropagation - ML Documentation, ml-
cheatsheet.readthedocs.io/en/latest/backpropagation.html.
Brownlee, Jason. A Gentle Introduction to Dropout for Regularizing Deep Neural Networks.
21 Apr. 2019, machinelearningmastery.com/dropout-for-regularizing-deep-neural-
networks/. Accessed 13 May 2019.
Cbs/ap. “San Francisco Bans Facial Recognition Technology.” CBS News, CBS Interactive,
15 May 2019, www.cbsnews.com/news/san-francisco-becomes-first-us-city-to-ban-facial-
recognition-technology-today-2019-05-14/.

13
Hirasawa, Toshiaki, et al. “Application of Artificial Intelligence Using a Convolutional
Neural Network for Detecting Gastric Cancer in Endoscopic Images.” SpringerLink,
Springer Japan, 15 Jan. 2018, link.springer.com/article/10.1007/s10120-018-0793-2.
Krizhevsky, A, and G Hinton. Learning Multiple Layers of Features from Tiny Images.2009.
Ng, Andrew. Supervised Learning. Stanford University, cs229.stanford.edu/notes/cs229-
notes1.pdf.
Shetty, Badreesh, and Badreesh Shetty. Curse of Dimensionality. 15 Jan. 2019,
towardsdatascience.com/curse-of-dimensionality-2092410f3d27. Accessed 13 May 2019.
Springberg, Jost Tobias, et al. Striving For Simplicity: The All Convolutional Net. Department
of Computer Science University of Freiburg, 2015, arxiv.org/pdf/1412.6806.pdf.
Suárez-Paniagua, Víctor. “Evaluation of Pooling Operations in Convolutional Architectures
for Drug-Drug Interaction Extraction.” BMC Bioinformatics, BioMed Central, 13 June
2018, www.ncbi.nlm.nih.gov/pmc/articles/PMC5998766/.
Ujjwalkarn. “A Quick Introduction to Neural Networks.” The Data Science Blog, 10 Aug.
2016, ujjwalkarn.me/2016/08/09/quick-intro-neural-networks/.

Convolutional Neural Networks

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Convolutional Neural Networks

Ähnlich wie Convolutional Neural Networks (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Convolutional Neural Networks