1. Image Segmentation and Classification for Fin Whale1
Identification2
Michael Ford
Kathleen Moriarty
Elijah Willie
Luke McQuaid
3
December 10, 20164
1 Problem Description5
In many fields of zoology, the ability to identify individual animals is foundational to many research6
endeavors, and can allow scientists to discover aspects of animal biology such as social dynamics and7
population distribution. However, depending on the animal being studied, the process of performing8
an identification may be difficult or involve significant manual labour. A primary example of this is9
performing identification of fin whales, which is a species of large cetacean that lives off the coast10
of British Columbia and has unique identifying markings for each individual. There are over 70011
individual fin whales, and when an animal is encountered their photo must be compared by hand to a12
reference catalogue. In this project we applied various machine learning approaches with the goal of13
automating this process of fin whale identification.
Figure 1: Example image of a fin whale indicating unique identification features.
14
1.1 Data set15
In collaboration with the Cetacean Research Program, Fisheries and Oceans Canada, we received16
a catalogue of 79 individuals with 884 images. This represents a small sample of the entire stored17
catalogue of identification photos, however the labels for the catalogue are recorded on hard copy,18
and due to the significant labour involved in manually labeling photos we were restricted to this data19
set.20
1.2 Identifying Challenges21
1.2.1 Data set size22
Previous work in the area of automated whale identification has shown that this is a feasible problem.23
In particular, the Kaggle Right Whale Recognition Contest of August of 2015 [kag()] had private24
submissions achieving log-loss of 0.59600. However the Kaggle data set contained 4237 photos25
for 427 individuals, a significant difference in scale from our data set. We recognized initially that26
the size of the data set would restrict our ability to train complex models due to the high chance of27
over-fitting.28
1
2. 1.2.2 Signal-to-Noise29
As can be seen in the figure below, many input photos have significant noise in the background of30
the photo. Due to the poor signal to noise ratio, and being unable to perform complex learning for31
feature extraction due to our small dataset, we decided to devote significant effort to pre-processing32
steps prior to doing any classification.33
2 Pre-processing34
2.1 Segmentation using Markov Random Fields35
Markov Random Fields have over the past few decades become an innovative way of denoising36
and segmenting a wide range of types of images. Therefore is without much thought that the pre-37
processing of the whale images contains a pipeline that appeals to MRFs. We used a previously38
implemented MRF [Sharma()] model for processing the images. For our purpose, this model was39
used for image de-noising, and segmentation by edge detection. This model has parameters that can40
be set by the user. This is useful for optimization, as a user needs may differ based on the context41
of use. For this model, we needed to specify the maximum number of iterations, the number of42
neighbours for k-means clustering for classifying a pixel in the image based on its neighbors, and a43
value for the potential function used in the model.44
This pipeline was very memory demanding, and as such, great care was given when picking parameters45
since running all the samples through the pipeline was quite lengthy in time. See figure below for46
the result of MRF applied to an image. This image was processed with the following parameters:47
(maxIter = 5, k = 3, and potential = 0.5). The model with these parameters took approximately 4548
seconds for completion. Multiplying this by the total number of images in our data set (838 images),49
we see that this model applied to the whole of the images would take approximately 10.5 hours.50
These parameters proved to be best compromise between computation time and quality of results as51
other parameters had longer computation times while yielding similar results.
(a) Original image before MRF transformation
(b) Image after MRF transformation
Figure 2: MRF applied to test image
52
2.2 Segmentation using Hidden Random Markov Field-Expectation Maximizaton53
(HRMF-EM)54
In addition to using regular MRF for image denoisng and segmentation by edge detection, we also55
used different model for image denoisng and segmentation. This time we used a model based on56
Hidden Markov Random Field using the Expectation Maximization algorithm learn the most probable57
parameters for this algorithm. This algorithm is fully described in [Wang(2012)]. This source also58
provided an implemented model which also enabled us to tweak the parameters until we were able59
to remove an adequate amount of noise from the original image. This model was just as memory60
intensive and time consuming as the model based on regular MRF. This model however contained61
one fewer parameter. We had to specify the number of iterations, and the number of neighbours for62
computing the k-nearest neighbour when classifying a pixel in the image. Great care was also taken63
2
3. when picking the parameters that would result in as little noise as possible remaining in the image64
when compared to original image. Both of these models (MRF, and HRMF-EM) were used for image65
denoising, segmentation and generating extra features for downstream training of the final model.66
See figure below for the result of MRF applied to an image. This image was processed with the67
following parameters: (maxIter = 5, and k = 3). The model with these parameters took approximately68
80 seconds for completion. Again, multiplying this by the total number of images in the data set69
(838 images), we see that this model applied to the whole of the images would take approximately70
17 hours for completion. Given this lengthy computation time, it is clear how necessary parameter71
optization was for this model. These parameters, however, proved to be best compromise between72
computation time and quality of results as other parameters had longer computation times while73
yielding similar results. In addition to having a pre-implemented model, these models were chosen74
because they represent the simplest statistical models for image denoising and segmentation that by no75
means assume that the variables within a system are independent. The variables here being the pixels76
within an input image. These models also allows us to model important inter-pixel dependencies, and77
conditional dependencies that can be taken advantage of for our purpose of pre-processing.
(a) Original image before MRF-EM transformation
(b) Image after HMRF-EM transformation
Figure 3: MRF-EM applied to test image
78
3
4. 2.3 Cropping79
2.3.1 Manual Cropping80
While developing a model to crop images automatically and pipeline them into our CNN identifier81
we had to work on the CNN itself. A small program was created to expedite the process of manually82
cropping all the photos. Images were greatly reduced in size using an augmenting program, then83
cropping was done using a code developed by [Rosebrock()] for cropping boxes out of an image. The84
cropped regions were then mapped back but to the full size images to get a high resolution version of85
the cropped photo.86
3 Whale Identification87
3.1 Binary Classification using Pre-trained CNN88
The initial model approach was to create a binary classifier using two identical, merged CNN89
structures. Siamese CNN structures have previously been applied with some success facial recognition90
[Khalil-Hani(2014)] and in one-shot image recognition [Koch(2015)]. This suggested that this model91
would be suitable for our whale identification using our small dataset. However, due to the small92
dataset size and limited available computation resources, we decided to implement the model using93
a VGG16 structure pre-trained on the imagenet database [Chollet(a)]. We removed the VGG last94
classification block, merged two identical models with learning turned off, and then tried several95
different classification structures, including adding one or two fully connected layers, and dropout96
layer, before the single classification node which used a sigmoid activation function. Training was97
preformed using full frame images reszied to 224x224. However we were unable to get any results98
that modeled more that the proportion of true to positive training examples. This made us conclude99
that either the signal to noise ratio was not adequate, or that the features that the pre-trained VGG100
network was not extracting features that were representative of the inter-individual variation.101
3.2 Minimal Compute Method to Establish Baseline Results102
3.2.1 Method103
A model of our minimal compute method, outlined in stages 1 and 2 below can seen in Fig. 4.
Figure 4: Model of Minimal Compute Method Pipeline
104
Stage 1: Feature Extraction105
Training a state of the art convolutional neural network (CNN) model for multiclass classification106
would require more compute power than we had at our disposal. Instead, a "CPU friendly"107
alternative method was used: whale images were feed through an Inception V3 CNN model which108
was pre-trained on ImageNet data set images, (similar to Assignment 3 [Mori(2016)]. Output109
from the last average pooling layer were collected and saved as feature vectors for each whale110
image. Our hypothesis was that our Inception V3 model, would have learned enough discrimina-111
tive information about each whale image to make classification on top of these feature vectors possible.112
113
Several variations of the original images were passed through the pre-trained CNN, to get114
different sets of feature vectors, (as depicted in Fig. 4). These sets were: original full images,115
4
5. cropped images, and cropped high resolution images, (they were not re-sized to 299x299 V3 input116
dimensions).117
*Stage 2: Support Vector Machine For Classification118
The feature vectors of each image, as discussed in Stage 1, were then passed into an support vector119
machine, to classify each image as one of 78 whales.120
Feature vectors were either left unnormalized or normalized, by scaling each vector to it’s unit121
norm,(i.e. L2-Normalization). This was proven to be an effective pre-processing step in previous122
works using SVMs for classification problems, ([Simonyan and Zisserman(2015)])123
With L2-Normalization, For each element E in feature vector x:124
Exnorm =
Ex
xnorm 2
(1)
However, After visualizing these normalized feature vectors, ( see Fig. 5), it should be noted that125
there did not appear to be enough variation between images which could be determined from their126
V3 learned features.
Figure 5: Normalized Feature Vectors Learned from Inception V3
127
3.2.2 Results128
Several hyper-parameters were tested during cross-validation, the results are shown in Fig. 6. The best
Figure 6: Cross-Validation Accuracy Reported Across Several Hyper-parameters
129
choice of hyper-parameters, resulting in the highest validation accuracy, was as follows: Surprisingly,130
despite L2- normalization outperforming in every other test, unnormalized features resulted in higher131
accuracy on the validation set. The highest validation accuracy was also achieved in conjunction132
with using a new feature vector, created from the feature vectors of full images, cropped images and133
5
6. high-resolution cropped images:134
Featuresbest =
Ftype
numFtypes
(2)
These hyper-parameters were then used to create our final ’CPU-friendly’ model, achieving a135
classification accuracy on our test set of 7.61 percent. (see Fig. 7)136
Figure 7: Test Accuracy, Using Chosen "Best" Hyper-parameters
The final model was also tested on whales which had more than 25 images included in the data137
set, resulting in only 12 classes. Our final model achieved a test accuracy of 19 percent with this138
simplified problem.139
3.3 Dual Input Merged Model140
In an attempt to create a model that took as input as much signal as possible, we constructed a141
dual-input merged classification model where the input was both the cropped original image, and the142
output of the MRF pre-processing using the cropped original image. Both inputs were re-sized to143
224x224. The structure was based on two independent VGG16 networks pre-trained on Imagenet.144
Training was performed on the final convolution block as well as an added fully connected layer on145
top of the VGG networks.146
6
7. Figure 8: Merged model structure. Blue indicates layers where learning was turned off, while yellow
layers had learning turned on.
The model was run using SGD with a slow learning rate of 0.00001 and momentum of 0.9 as147
suggested by ’Building powerful image classification models using very little data’ [Chollet(b)].148
While training accuracy of 99.85% was achieved, this was accompanied with zero testing accuracy149
using 20% of the data set as test data. This indicates that the model was too complex for the size of150
the data set provided.151
3.4 Simple Convolution Neural Network152
Due to the lack in ability of the pre-trained VGG network to extract relevant features, and the153
massive over-fitting that resulted when learning was applied to only a limited portion of the VGG, we154
attempted to learn a simple CNN from scratch. The network consisted of a single 2d convolution155
layer, a max pooling layer, fully connected 64 node hidden layer, dropout of 0.5 and classification156
layer. ReLU was used as the activation function for hidden layers while softmax was used for the157
classification layer. See figure below for results.158
Input Learning Rate Momentum Train Accuracy Test accuracy
HMRF 0.1 0.0 0.142 0.0074
HMRF 0.00001 0.5 0.128 0.147
HMRF 0.00001 0.9 0.399 0.0662
Cropped 0.00001 0.9 0.0456 0.441
Figure 9: Results from training using different input data sets and parameters. HMRF input refers
to the the HMRF pre-processing applied to all the data set, while Cropped refers to the manually
cropped data set. All inputs were re-sized to 224x224.
4 Discussion159
4.1 Contributions160
Michael initiated the project and collaborated with the team from Fisheries in order to obtain the data161
set and understand the problem from the scientists perspective. Luke and Elijah teamed up to work162
on the pre-processing half of the project. While everyone met and dealt with issues all together we163
7
8. all had our own specific jobs. Luke developed a process for cropping images manually and focused164
on that for the first half of the project in order for the whale classification to test and refine their165
programs on better cropped images. Once we had a full set of high resolution cropped images Luke166
continued by assisting where he could with whale classification and Elijah’s automated cropping167
tool. Kathleen was responsible for the Minimal Compute Method for Baseline Results and Support168
Vector Machine for Classification, while Michael was responsible for the Binary Classification using169
Pre-Trained CNN, Dual-Input Merged Model and Simple Convolution Neural Network.170
References171
[kag()] https://www.kaggle.com/c/noaa-right-whale-recognition.172
[Sharma()] Kamal Kishor Sharma. github.com/kamalkishor/Wound_Image_Segmentation_by_Markov_Random_Field.173
[Wang(2012)] Quan Wang. Hmrf-em-image: Implementation of the hidden markov random field174
model and its expectation-maximization algorithm, 2012.175
[Rosebrock()] Adrian Rosebrock. tionReferences [1] http://www.pyimagesearch.com/2015/03/09/capturing-176
mouse-clic.177
[Khalil-Hani(2014)] LS Khalil-Hani, M; Sung. A Convolutional Neural Network Approach for Face178
Verification. NUCLEIC ACIDS RESEARCH, 2014.179
[Koch(2015)] Gregory Koch. Siamese neural networks for one-shot image recognition. PhD thesis,180
University of Toronto, 2015.181
[Chollet(a)] Francois Chollet. deep-learning-models, a. https://github.com/fchollet/deep-learning-182
models.183
[Mori(2016)] Dr. Greg Mori. Assignment 3: Deep learning. 2016.184
[Simonyan and Zisserman(2015)] Karen Simonyan and Andrew Zisserman. Very deep convolutional185
networks for large-scale image recognition. ICLR, 2015.186
[Chollet(b)] Francois Chollet. Building powerful image classification models using very little187
data, b. https://blog.keras.io/building-powerful-image-classification-models-using-very-little-188
data.html.189
8