Master thesis

SORBONNE UNIVERSITY
Internship report
Masters degree in Computer Science majoring Data science DAC
(Données, Apprentissage, Connaissances)
Computer Science Department
Faculty of Science and Engineering
• Psychographic Spatial Segmentation
• Ads Recognition
Author
Abdelraouf
KESKES
Mentor
Pantelis
MAROUDIS
September 1, 2020

Acknowledgement
Within this unusual pandemic circumstances and tragic 2020 year, I would like to express my
deepest thanks to My Family who backs me up to lay this milestone degree in this prestigious
university and be on the point where I am today.
I am very grateful for having a chance to be part of JCDecaux company and especially the
DataCorp division where I met so many wonderful people and professionals who endorsed my
professional development. I also want to express my particular thanks to my mentor Pantelis
MAROUDIS (Data scientist), and my talented colleague Joris TRUONG (Data scien-
tist) for taking part in useful decisions, having fruitful discussions and giving valuable advices
throughout the internship.
Finally, I would express my sincere and special thanks to all my distinguished professors
and teachers in Sorbonne University for their dedication, world class courses, and vivid classes,
and more importantly for transmitting me the discipline of hard work and always seeking the
top to be the best version of myself.

About JCDecaux
JCDecaux Group is a multinational corporation based in Neuilly-sur-Seine, near Paris.It is the
largest outdoor advertising corporation considered as the top outdoor advertising company over
the world,whether in bus stations, airports, urban furniture or even more, with more than 410
million reached people on the planet every day.
My internship took place inside the DataCorp division, a division recently created to leverage
the data flowing within the company whether it is internal, open, or obtained by partnerships
in order to investigate its usefulness in the advertising context and add a strong data-driven
dimension to the business landscape.
DataCorp consists of 25 employee divided into 4 teams :
• Practice (My Team) : It is a pure technical team which is in charge of all the data
pipeline from collecting the raw data to building efficient predictive models.
• Project Management : It is a team that aims to optimize the products and solutions
roadmap to ensure their relevance to the competitive advantage of JCDecaux.
• Partnerships : It is a team in charge of mining and signing meaningful data partnerships
for JCDecaux by working with external data partners.
• Communication : It is a team which raises awareness that the data is a core asset for
JCDecaux and that the Group is data-driven, whether for internal or external audiences.

Abstract
During the internship, I’ve been involved into two projects which are very different and inde-
pendent .However, they are both related to the outdoor advertising context .
The first project is called "PSA" (Psychographic Spatial Segmentation) where we tried to mine
an external partnership data and extract some relevant patterns in order to model and segment
a geographic city map in a way that reflects people’s opinions, habits and tastes (psychograph-
ics). For instance, the result for Paris city could be 1st,2nd, 7th arrondissements have similar
psychographics, they do prefer Heinken Beer, Action Movies, Classical music,the 18th, 13th,
20th are quite similar and do prefer Leffe Beer, Comedy movies, Rap music. We devoted a lot
of time and energy for the project.Unfortunately, since the partnership with the data provider
has stopped due to the unpredictable COVID-19 crisis the contract signing was postponed to
2021 and therefore the data were not available and the project was interrupted.
The second project, was about controlling Ads content, for example we don’t want to display
alcohol advertising in school areas.The problem technically could be formalized as an "image
recognition" or more precisely "a multi-label classification" where we want to detect objects/-
tags in ads images without the necessity to localize them.We build the whole data pipeline from
gathering raw data to predictions .But,According to the business context and time constraints
we aimed to leverage all the available free models, tools, and datasets and do a comparative
study with an amazon paid service called "Amazon Rekognition"

Contents
Acknowledgement 1
About JCDecaux 2
Abstract 3
1 Problem 1 : Psychographic Spatial Segmentation 5
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Data provider : Qloo API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Approaches Shortlist . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Problem 2 : Ads Recognition 13
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 global problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.2 Classes definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.3 Data gathering and labelling . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.4 Our process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Datasets and Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.3 Models x Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 Our approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.1 Static handcrafted mapping . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4.2 Dynamic mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5 Metrics and Finetuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6 The comparative benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.8 Further Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Bibliography 33

1 Problem 1 : Psychographic Spatial Segmentation
1.1 Introduction
Within an outdoor advertising company, the ads content represents the core chunk its business
industry. One of their ultimate objectives is to maximize the relevance of the ads content in
a geographical area. By aiming so, JCDecaux needs to find out and understand deeply the
inhabitants habits, lifestyles, tastes, opinions, beliefs etc ..., that’s what we call "Psychograph-
ics".These psychographics data are provided by an external data provider called Qloo where we
request their API by giving a geolocation area (a rectangle or a circle) , and the API returns
a list of entities (Movies, Music, Artists, Traveling destinations, ...) with their affinity scores
reflecting the inhabitants opinions and tastes. From this raw data, we intend to divide a city
map into segments/clusters sharing the same preferential and lifestyle patterns (not necessarily
spatially related).
The major difficulties of the problem are :
• It is an unsupervised learning problem by definition, so the optimization/evaluation met-
rics are inner to the the data and purely mathematical.Therefore, we have no guar-
antee that the final finetuned segmentation reflects really at a certain extent
the ground truth of real life, but it could be very fruitful and insightful for marketing
experts within a very complex graph of assets(advertising boards) in a city, and it could
help them to discover new patterns and as a consequence raise new strategies.
• The size of the regions (rectangles or circles) to cluster on while dividing a city, could
lead to a very dissimilar final results
• How to represent these regions ? Embedding ? Since our data is not purely numerical
or quantitative, it has a very important characteristic which is the order/ranking or more
precisely the preferential aspect because when we request Qloo API for a region in a city
it returns a sorted list (by affinity score) of cross-domain entities
• The preferential data returned by Qloo is cross-domain, which means that in a city
and for a specific region we don’t retrieve entities of a specific domain such as Beauty or
Music or Series, we instead retrieve all relevant entities for all available domains.
1.2 Data provider : Qloo API
Qloo is an american company that uses AI to understand taste and cultural correlations. It
provides companies with an API to request their services .Basically, It establishes consumer
preference correlations via machine learning across data spanning cultural domains including
music, film, television, dining, nightlife, fashion, books, and travel.
5

Regarding our application, the API proposes a recommendation system that given a geolocated
region, it returns a list of different entities with their relevance score belonging to several
domains.According to the first agreement the list of the domains/sub-domains that we were
supposed to have is :
• Brands(Automotive, Health Beauty, Fashion, Electronics)
• Films(Movies)
• Music(Artists)
• Travels(Hotels, Destinations)
• TV(Series)
The Qloo API was also endowed with some filters to make the recommendations based on :
• the gender of the population
• the age range
• the domain/sub-domain to get entities from a specific domain such as Movies
These filters could be very appealing to our experimental phase and could be very impactful
on the final result.The following figure depicts exactly what we get as data :
Figure 1: Qloo API data retrieving
In addition, we suppose that Qloo team behind the scenes has done a lot of work from web scrap-
ing raw data (google’s reviews, movies rating and reviews, twitter following, likes, reactions,
etc ...) to build such a qualitative recommender system and obviously the more interactive the
city is the more reliable and trustful the data are. In our scope, we assume that Qloo team
has done an outstanding work and consequently the data is very reliable and reflects people’s
tastes.
6

1.3 Problem Setting
From the raw data returned by Qloo API, we aim to create clusters of regions that share the
same preferential patterns.Thereby, the name of the project stands for :
• Psychographic : Qloo API returned data that is based on people’s tastes, opinions,
lifestyle, Beliefs, in other terms "psychographics".
• Spatial : We are dealing with geographical regions (a spatial information) .
• Segmentation : the goal is to segment a city in the same way as the urban segmentation
by grouping regions into homogeneous segments which shares the same preferences.
Figure 2: PSA problem overview
More formally our data is mainly a matrix
X =



score11 . . . score1M
.
.
.
...
.
.
.
scoreN1 . . . scoreNM



Region×Entities
where :
• N is the number of regions
• M is the number of entities
• 0 6 scoreij 6 1
In addition, we also have some extra data that corresponds to entities features (such as Music
genres, Restaurant Food type, etc ...) return by Qloo API which could be investigated to refine
7

our final clustering and maybe lead to better results .
The latter could be represented as a sparse Matrix as the following :
X0
=



1 0 1 . . . 0
.
.
.
.
.
.
.
.
. . . .
.
.
.
0 0 1 . . . 0



Entities×Charateristics
1.4 Literature review
Since, there is no paper that solves exactly the same problem as us, we decided to spend a lot
of time on this part to cover the maximum number of ideas from the literature that could be
relevant and may lead to a good solution .Our review was based essentially on:
• papers that has the same goal but not the same data, for example geographical urban
clustering based on Point of Interests
• papers that have the same preferential aspect within the data, and aims to cluster them,
but there is no spatial/geo-oriented information, such as clustering user/movies rating
data .
After compiling about 60 research paper, synthesizing them and filtering them, we end up with
a pre-selected list of 15 papers that we organized in a kind of mind map as illustrated in the
following figure :
Figure 3: Literature Review mind map
As you can see, we could basically split the litterature into two families :
1. those who consider the data as purely quantitative ignoring the fact that there is an or-
der/preferential aspect, scores are therefore just numbers and regions as simple geometric
datapoints in Cartesian spaces.
8

In this family we could find well-known clustering algorithms references such as K-
means [1], agglomerative hierarchical clustering [2] [3] [4] [5]and DBSCAN [6] .Also, we
could find some references where we project our raw data into another space by modeling
the problem in a more intelligent way such as LDA Modeling(with DER extension) [7] [8],
or SVM Modeling [9] (see the next section for details)
2. The second family of papers is where they take into consideration the preferential/ranking
aspect with the data, since in our case the Qloo returned entities are sorted by their
relevance score within a geolocated area, and here we have also two branches:
• The first one is model-based where we assume that our heterogeneous data comes
from a mixture of distributions, each distribution is a probabilistic generative model
representing «one cluster» characterized by a central representative order of prefer-
ence/ranking of entities(equivalent to the mean in a Gaussian distribution) and a
variability parameter(equivalent to the standard deviation in a Gaussian distribu-
tion).Here, the mixture is learned using EM algorithm coupled with MCMC tech-
niques and many tricky variations, so the core research work was done basically on
the probabilistic ranking generative model that is supposed to generate an ordered
list such as ISR Model [10], Bayesian Placket-Luce Model [11], Weighted distance-
based generative models where we have different distance metrics : kendal-tau
(Bayesian [12] and determinstic [8] Mallow Model), Spearman, Hamming, Footrule,
etc ...
• The second one contains model-free methods and consequently computationally effi-
cient such as K-o-means EBC [13] where they used Kmeans with Spearman dissim-
ilarity and formalized the central order with Expected Borda Count, CCA [14] and
agglomerative copula clustering [4] where they used max linkage with TOP ranks
dissimilarity based on copula Clayton function
for clarification purposes, we would like to highlight an important point regarding the
ranking generative models cited above.So, when we talk about Bayesian Models, the
term "bayesian" refers to non determinstic models when a distribution including some
extra information about uncertainty is given for each learnable parameter rather than a
point-estimate value ,
1.5 Approaches Shortlist
We have fixed a primary list of 4 approaches to absolutely experiment
1. The classical approach : when we consider our data as any quantitative data ignoring
the order aspect, trying all the classical algorithms cited above such as DBSscan, Kmeans
and agglomerative clustering.
Our core work should be on two aspects :
9

• the similarity/distance metrics : L1, L2, linkage, Jaccard distance, ...
• the regions embeddings : Feature selection, Feature handcrafted engineering, Di-
mension reduction (PCA, TSNE, ...), for instance a Regions × Entities_Features
Matrix with entities (counting|TF-IDF) weighted by the affinity score .
• LDA Modeling : Where we could consider the Regions as Documents, Words as
Entities features(for example music genres) and Topics as our Clusters
Figure 4: LDA Modeling
• SVM Modeling :where for each region we learn an SVM, thus we learn a weights
vector w that represents the region preferences, and therefore each region will be
represented by its SVM w vector, as the following :
Figure 5: SVM Modeling
where :
– xi
is an entity
– when the entity x1
is preferred to the entity x2
in a region (have a higher affinity
score), we label x1
− x2
as +1 and reversely for −1
Afterwards, we construct a cosine similarity matrix Regions × Regions
cos(W1, W2) =
W1 ∗ W2
||W1|| ∗ ||W2||
10

Finally, we perform the iterative Dubnov clustering algorithm based on L∞ norm
and the Jensen-Shannon divergence until convergence.
the advantages of this approach are the fact that it is a multistage approach (freezing
embeddings),the regions similarity Matrix of size Regions × Regions which is very
suitable for computations especially using Dubnov Clustering
• Mixture of ISR Models: as explained before, it consider all the datapoints as
resulting from a mixture of generative models where each model is an Insertion
Sorting Rank (ISR) Model, the latter assume that a datapoint results from a sorting
algorithm based on paired comparisons, characterized with a central ranking, and a
dispersion parameter.
The advantages of this model are :
– it has an extensive experiments on a very similar use case which was basically
"the Clustering of the European countries according to their votes at the Euro-
vision contest between 2007 and 2012" (see Figure 6)
– it takes into consideration partial rankings (in our case an entity that does not
appear in all regions)
– It is multivariate, where the entry of the algorithm is a 3D tensor (Countries,
Vote Candidates, Years) which is very appealing to our case (Regions, Entities,
Entities Features)
– It uses a sophisticated and computationally efficient Algorithm called «SEM-
Gibbs» rather than a straightforward EM Algorithm.
11

Figure 6: European countries clustering according to their votes at the Eurovision contest
between 2007 and 2012 with ISR Models
1.6 Conclusion
Unfortunately, the unpredictable COVID-19 crisis has postponed Qloo partnership contract
signing because of the economical collapse caused by the crisis that touches the global world
and especially JCDecaux where their incomes are mostly from outdoor people’s activities
which is extremely limited with the quarantine.
But, we do believe that we did a great literature review that covers almost all what could be
relevant to the final objective.It will certainly help and facilitate the task in the future once
they decide to resume it
12

2 Problem 2 : Ads Recognition
2.1 Introduction
Within an outdoor advertising company, the ads content represents the core chunk of the its
industry, whether from an economical perspective, an ethical perspective or even more.
Many outdoor media owners are subject to many rules and restrictions that ban them from
placing certain ads, for example those that are sexually suggestive, or those that promote
age-restricted products such as alcohol, gambling or e-cigarettes, within school areas.Another
striking example is to display snack junk food around healthcare and hospitals areas, beside
sensitive diseases departments.
By assessing whether an ad has been responsibly placed or not, the authorities have started
seriously taking into consideration these kind of restrictions and want to take action and penalize
severely those who do not respect them.Therefore, marketers have to consider these restrictions
and ensure that the ads are not targeting an inappropriate audience .
Even though, the advertising campaigns are carefully planned by experts, the necessity to assist
the supervision with a machine learning layer is indisputable for two important reasons :
• the assets(advertising boards) graph is very complex and even experts could make mis-
takes by allowing involuntary some restricted content to be exposed to the wrong audience
• the displaying process is mostly automated and programmed according to the advertising
campaigns calendar, and as we all know, computer systems can bug and display intolerable
content .Hence, the necessity to investigate what could be relevant to our use case ,in order
to endow the system with cutting edge models to assist experts and ease their tasks by
getting notifications where some content is susceptible to be misplaced.
the following figure depicts some ads that are forbidden to be displayed for kids around schools
:
Figure 7: ads images showing some restrictive content for kids (Alcohol, Sexual, Snack)
13

It is important to understand that the school example was just to illustrate explicitly the
problem from a real life perspective, the restriction around the content could be around various
categories and not only Alcohol or Snack Food.
2.2 Problem Setting
The previous section has given you a real life perspective to the problem, and in this section,
we will formalize the problem from a technical perspective. Additionally, after the following
section you should have a clear vision and understanding about What exactly we want to do
(our objective) ?
As a practical use case and after interviewing the corresponding content experts, they have
transmitted to us a table containing all the categories that they want to detect, and consequently
as a starting point we realized that all what they want to detect are tangible objects inside
ads images .Therefore, ads representing concepts such as "Sexuality", "Smoking cessation",
"lottery games", etc ... are actually out of our scope and at least for the first version of the
system .
Figure 8: A portion of the raw list of categories
14

2.2.1 global problem definition
As described previously, we want to extract the content of an ad image and detect some relevant
categories, the following figure illustrates exactly the task :
Figure 9: a global view of the problem
First, we don’t have a dataset so we will build it from scratch based on the targeted categories.
Second, since we don’t aim to localize the objects on the image, it is very important to mention
that technically the task is not an Object Detection task, but rather an Image Recognition
or more precisely a Multi-Label Classification task.Thus, an image that contains one hundred
Vodka bottles or an image that contains one vodka shot are similar to us and we want just to
get the label "Vodka" or more abstractly "Alcohol" .
Here raises a very decisive question which will have a big impact on the data gathering, labeling
process, and obviously the performances and which is the following : "Which Level of ab-
stract categories we want to detect exactly ?" the following figure contains two objects
and illustrates a striking example about this question
Figure 10: an example illustrating the annotation abstraction questioning
15

2.2.2 Classes definition
Before answering the previous question, we need first to define the global labels abstraction
hierarchy.So, after prepossessing the blue table (see Figure 8), filtering out unnecessary cate-
gories, merging some categories (For example "Bourbon" and "Whiskey" into "Whiskey"), we
devised carefully the following hierarchy :
16

Figure 11: Classes hierarchy
Globally, as you can see we have 3 level of semantic abstractions.
• Level 0: it is very generic and almost meaningless to take decisions about the ads content
and contains basically 4 classes : Food, Drink, Confectionery and Medical
• Level 1: It is adequate to take decisions about the content and covers classes like :
Alcohol, Dairy Product, Dessert, Soft drink ...
• Leaf Level: It is very specific and could make the experimental phase very rich and
extensive
to make the experimental step more interesting and fruitful, We decided to annotate the data
on the leaf level, despite the fact that the level 1 annotation is sufficient to our problem, but
with a simple mapping we could switch from the specific to the generic .For instance, an image
labeled Coffee we could easily change the annotation to Soft Drink to Drink in the code.
Therefore, we end up with 59 granular categories which are the following : ’Cereal bar’, ’Fruit
bar’, ’Chewing gum’, ’Sugar candy’, ’Chocolate’, ’Medical’, ’Coffee’, ’Juice’, ’Carbonated drink’,
’Energy drink’, ’Alcoholic cocktail’, ’Beer’, ’Stout’, ’Cider’, ’Liqueur’, ’Brandy’, ’Gin’, ’Vodka’,
’Port’, ’Rum’, ’Sherry’, Whiskey’, ’Wine’, ’Champagne’, ’Cava’, ’Vermouth’, ’Cooler’, ’Cake’,
’Pastry’, ’Pie’, ’Yoghurt’, ’Custard’, ’Cream’, ’Cheese’, ’Fromage frais’, ’Ice lolly’, ’Ice cream’,
’Butter’, ’Cooking oil’, ’Bacon’, ’Sausage’, ’Cooking sauce and condiment’, ’Pizza’, ’Quiche’,
’Bread’, ’Biscuit’, ’Cracker’, ’Savoury food spread’, ’Snack’, ’Nut’, ’Crisp’, ’Honey’, ’Syrup’,
’Jam’, ’Sugar’, ’Artificial sweetener’, ’Soup’, and ’Meal’ .
2.2.3 Data gathering and labelling
The sources of the data were internal (JCDecaux Orphea API) and external(Web scraping)
while gathering the data was very time consuming especially for some categories which are
17

hard to feed and where ads images are almost unavailable such as : Quiche, ’Savoury food
spread’, ’Fruit bar’, ’Artificial sweetener’ ...
During the collection process we tried to ensure to build a balanced dataset within all abstract
levels but mostly on leaf classes
Figure 12: The collected data distribution
Additionally, we fixed a goal of 10-20 images/class and the reason why this number will be
explicitly detailed is the benchmarking process (see the next section "Our process")
Also, the annotation tool was carefully selected,because we realized that most of available
tools are not suitable for our case, they are specific to some advanced tasks such as «Object
Detection», «Segmentation», etc ... and we ended up using a very simple tool found on github
called LabelClass that we customized to make the annotation process faster.
Figure 13: The adapted labeling tool
18

2.2.4 Our process
Since, the company has access to Amazon services, there is a tool called "Amazon Rekog-
nition(AR)" which is generic and trained on thousands categories with millions of images
.Therefore, we don’t aim to create a model, train it, . . . from scratch to compare with AR.But,
we were asked to :
• leverage what’s available as pretrained models, architectures, similar and relevant datasets
.
• benchmark the available pretrained models on our custom small dataset and compare
their performances with "AR".
In case, we have better performances from what’s available and open sourced, the company
could cancel this service subscription .
2.3 Datasets and Models
We started investigating which datasets and models that could be pertinent and relevant to
our classes and our goal
2.3.1 Datasets
We a did a research about the most generic open datasets that covers several categories which
could be close to ours and we summed up them in the following table :
Figure 14: Datasets
We realized that most of the available datasets does not match exactly our expected categories
since for example PASCAL VOC and COCO are very generic and others are very specific
(for Example KITTI is for autonomous cars).We found one interesting dataset which is Open
images with more than 600 categories for object detection, and 6000 classes for Multi-Label
Classification, the latter was very appealing to our expected classes .
19

2.3.2 Models
Then, we a did a research about the corresponding models, it is important to understand
that both pretrained object detection models and pretrained multi-label classifica-
tion models could help us in our benchmarking task, because both of them returns at
least the categories found on an image with their confidence scores
Figure 15: Models
a brief intuitive explanation for each appealing model is given in the following list :
• Yolo [15] : It is a robust one-stage real-time object detection model backboned by a
feature extractor called "Darknet".Yolo explores the convolution principle to pass the
whole image to the network in one pass, it learns to detect interesting regions through
regression, and to identify objects through classification with high confidence.Additionally,
many tricks and ideas are included on the process to refine the results such as NMS
supression or others.
Figure 16: Yolo concept
20

• Faster RCNN [16] : It is a two-stages object detection model, the first step focused
on extracting region proposals with RPN network, the second step focuses on classifying
these regions of interests, there is also a upstream part through RoI Pooling.
Figure 17: Faster RCNN concept
• RetinaNet [17] : It is a one-stage object detection model which uses the Spatial Pyrami-
dal feature extraction and whose novelty was essentially in the Focal Loss (cross-entropy
v 2.0) which was introduced to solve the unbalanced data issue such that it down-weights
the loss assigned to well-classified examples throughout learning.
Figure 18: RetinaNet concept
• Resnet101 : A convolutional neural network of 101 layers, backboned by Resnet blocks
(cf figure 14), the latter is conceived to ensure the theoretical hypothesis which states to
"the deeper we go the downer the loss decreases" and avoid the U curve that contradicts
the learning theory fundamentals.The intuition behind it is that the neural network should
21

learn from x whether to map F(x) or skip the block and map the identity x in case we
overcomplexify the model and add unnecessary layers
Figure 19: Residual learning block
• Inception3 : It is a convolutional neural net, where the v3 is an improvement of the initial
Version Inception1/GoogLeNet and Inception2 by focusing on factorization (reducing the
number of connections/parameters without decreasing the efficiency), it is built essentially
with Inception Blocks as shown in the figure 15
Figure 20: Inception3 architecture
22

2.3.3 Models x Datasets
We did a combination matrix to have a clearer vision and we sought all the models that are
available for Open images dataset whether for 600 categories object detection or 6000 categories
multi-label classification, and it turns out that the list of the relevant models to compare with
Amazon Rekognition contains mainly 5 models (3 objected detectors and 2 multi-label classifiers
described previously). Intuitively, We had a lot of hope on the multi-label classifiers since the
domain of categories is larger and covers the vaste majority of our granular categories .
Figure 21: Models x Datasets
2.4 Our approach
A major problem of our process is that the output categories whether from Amazon Rekognition
(thousands categories) or the Open Images Pretrained models (600/6000 categories) does not
contain all our expected categories and sometimes they are available in different terms like
instead of "Carbonated Drink" you could find "Soda", "Coca Cola", "Coke" ...
Traditionally, we do use "Transfer Learning" to switch from one set of categories to another,
but since :
• It is out of our scope by definition : The decision makers want us to benchmark
what’s straightforwardly available without training a new model etc ...
• Our dataset contains 600 images : it is a tiny dataset and could easily lead to a
drastic overfitting even with transfer learning and regularization techniques, and also due
23

to its very small size between training, validating and testing we could not have a trustful
and reliable performances evaluation
Thus, we decided to map straightly the returned categories to ours as the following :
After investigating some tricks we set up two approaches :
2.4.1 Static handcrafted mapping
where we passed through all the outputs from all models and we built a mapping dictionary as
the following :
Figure 22: Static mapping
This approach has only 1 hyper parameter (the confidence score threshold)
24

2.4.2 Dynamic mapping
where we proposed an NLP approach, based on distance
Figure 23: Dynamic NLP mapping
The distance formula that we chose is :
d(labelpred, labelour) = α ∗ hd(labelpred, labelour) + (1 − α) ∗ cd(labelpred, labelour)
where :
• hd(x, y) : is the hierarchical distance between two words x and y, more precisely the
WordNet Wu-Palmer Similarity (based on depth + most common specific ancestor).The
utility of this distance is shown on the following examples :
hd(”Lager”, ”Alcohol”) = 0.9
hd(”Water”, ”Alcohol”) = 0.4
hd(”Coffeemaker”, ”Coffee”) = 0.5
• cd(x, y) : is the contextual cosine similarity between the two Word2vec-GloVe embeddings
of x and y.The utility of this distance is shown on the following examples(when sometimes
an image contain context related object to a certain category but very far hierarchically):
hd(”Coffeemaker”, ”Coffee”) = 0.85
hd(”Water”, ”Alcohol”) = 0.59
hd(”Beerglass”, ”Alcohol”) = 0.7
Note that this approach has two hyper parameters which are the confidence score threshold
to filter out classes and also α to adjust the importance weight between the hierarchical distance
and contextual distance.
In addition, all the next reported results, and figures are from the first approach (the hand-
crafted one), because the second approach was not very promising from the very first experi-
ments and it seems to require a lot of time and computational power to refine it.
25

2.5 Metrics and Finetuning
• Technical Metrics: they aim to optimize and evaluate, so for each category we have :
1. Recall : Among all expected images for a specific class C how many images we
detected correctly Recall = TP
TP+FN
, intuitively it reflects at a certain extent how
much false negatives we could tolerate
2. Precision : Among all what we detected as class C how many were really belonging
to this specific class,Precision = TP
TP+FP
intuitively it reflects at a certain extent how
much false detection we could tolerate.
3. F1-score : the harmonic mean between Precision and Recall F1 = 2 ∗ Precision×Recall
Precision+Recall
An illustrative example : Let’s suppose that we have a list of 10 images where the
1,3,4,10 images contains «Pizza» and our model predicted that the images 1,3,7,8,10
contains «Pizza» .The Metrics for the «Pizza» category are :
Precision = 3
5
= 0.6 Recall = 3
4
= 0.75 F1 = 0.67
Figure 24: "Amazon Rekognition" evaluation metrics for different confidence score thresholds
within Pizza, Chocolate and Honey classes
the figure above was for illustrating, however for optimizing the threshold score we
need a metric that sumed up the results for all categories
Averaging Metrics : there are different methods of averaging in order to optimize
the confidence score threshold such as : macro, weighted, micro, samples, and even
more.Despite the fact that macro and weighted averaging was the most interesting
to us but since all averaging methods leads to almost the same best threshold value
so we did not struggle to get the best value for each model (see Figure 25)
• Human readable/ Communication Metrics After selecting the best threshold for a
model, to expose and communicate its performances to non technical persons :
– Since the recall is more interesting to us than the precision because we do prefer
to do some additional false detections (for instance detecting coca cola as Alcohol)
rather than skipping an ad that should be banned (For example Whiskey image in
26

Figure 25: The 4 averaging metrics to optimize the confidence score threshold for Amazon
Rekognition
a school area) . we reformulate it as the "good detections rate" for each categories
(see Figure 26 left side)
– we proposed a sample-based metric where we count the number of images where the
model has detected (see Figure 26 right side) :
∗ all the objects available on the image correctly,
∗ at least one object correctly, or missed all the
∗ zero content on the image (completely missed it)
Figure 26: Human readable / Communication Metrics for "Amazon Rekognition"
27

2.6 The comparative benchmarking
As explained previously, we have annotated our data in the leaf classes level in order to be able
to switch the annotations to a more abstract level quickly and compare models performances
within all abstraction levels.
• Leaf level "59 class" : As discussed and explained previously we chose Recall for results
communication because it is more appropriate to us after selecting the best confidence
score threshold . When you see the figure below you will realize that as expected when we
label our images in a very specific granular way, "Amazon Rekognition" has a huge gap
over other models and bypasses them by far whether in terms of number of detectable
categories where he is able to detect 41 classes among 59, or in performances (Pizza, Beer
and Gin have a recall score of 1), and obviously we justify this by the fact that AR was
trained on a huge and diverse dataset covering a wide range of categories and certainly
carefully finetuned since it is not free. Also, we could notice that :
Figure 27: Comparing the Recall between different models at the leaf annotation level
– Even if we combine and bag all models together, it will not beat Amazon Rekognition
because most of them a centralized on a certain set of categories
– the most competitive model at this level is The 6000-classes Resenet101 multi-label
classifier which could be justified by the number of categories
– Yolo is missing completely the predictions, and we could justify that (even for
the next comparative figures) by guessing that either the given official pretrained
weights for 600 categories Open-images were completely messy or the author has
just published a starting/warming weights but did not train it to the fullest (requires
several weeks or months)
28

• Level 1 "30 class" :
Figure 28: Recall, Precision, F1-score between different models for different Level 1 classes
That’s the most interesting level, and as we can see the models are more robust and com-
petitive even though it shows a slight advantage for Amazon, for example the Alcohol
F1-score for Amazon is 0.75 and for both Resnet101 and Inception3 is 0.68 We also could
notice that for the category "Snack" which is very wide, grouping a lot of objects such
as Crisps, Nuts, Sandwich, Junk Food, Sugar candies, Biscuit etc ... Amazon Rekogni-
tion has been bypassed by 3 models Faster RCNN, Resnet101 and Inception3, we could
justify this by the fact that the snack class is a very generic class and consequently very
ambiguous and for this reason Amazon did not handle it as a specific class
• Level 1 "4 classes" : Not very relevant as a level to annotate on, since it is very generic
: Food, Drink, Confectionery and Medical(grouping essentially medication ads images).
but worth experimenting.
When you see the figure below (Figure 29),we notice that as expected the models became
extremely competitive with a very slight advantage to Amazon Rekognition such as Food
class F1-score, for AR it is 0.9 and for Resnet101 it is 0.89.
29

Figure 29: Recall, Precision, F1-score between different models for different Level 0 classes
2.7 Conclusion
«Amazon Rekognition» is globally better and dominates all the other models within all annota-
tion levels.However, the more abstract the annotations are, the more accurate and competitive
the models are, and sometimes(very rarely) AR could be beaten in some generic categories like
"Snack" or "Meal".
To endorse and illustrate this conclusion,we have made the following figure
Figure 30: Number of images where the different models detect all, at least one, and zero
instances correctly
30

2.8 Further Improvements
The proposed improvements are between 2 axes :
• Quick solution : to continue in the same line as the previous work around quick bench-
marking solutions for business, we realized that what we have actually done is to extract
an ad’s content only from a visual perspective (using visual models whether they are de-
tectors or classifiers).Whereas, in most of advertising content we have a relevant textual
information and we have also a logo.Thus, the proposed improvement should first of all
endorse the visual perspective with a "logo recognition" part, and then add a textual
perspective while extracting content from and ad with a "Character/word recognition"
layer ( See figure 31)
Figure 31: Improving the Benchmarking solution
• Time consuming but carefully devised solution : Even though this proposition
was for us primary and indisputable, but after a long debate with decision makers, they
rejected it and they privileged the straightforward benchmarking way and consider it as
shifting the project’s goal which is fundamentally to do the comparative benchmarking
study .
So, the proposed solution is to :
1. focus on building a richer qualitative dataset
2. Start with a classical Transfer learning paradigm since the source and destination
outputs domains are different, and the source and the destination inputs are from
two different distributions (Open images natural images) vs (ads images colorful
posters with text, art etc...)
3. explore some very advanced techniques such as Few|One|Zero shot learning
31

4. If we want to keep the list of categories to detect as "Open" to the real world
and explore the possibility to add some new categories gradually, we could investi-
gate the CVPR amazing work of Open Long-Tailed Recognition (OLTR) where they
proposed an OLTR model to tackle simultaneously long tailed recognition (imbal-
anced classifition + few shot learning) and novelty detection. The model was based
on learning dynamic features from the classical ones generated by any backbone.
To solve imbalanced dateset they showed that class-aware-sampling during classifier
training gave the best results. On the other hand few shot learning problem was
solved by knwoledge transfer from head to tail classes via learning memory centroids
for all classes. Finally, open recognition was approached by optimizing the triplet loss
to force separable clusters of each class with a normalized distance in the close-set
space, then open classes detection is done by threshing over softmax output.
32

Bibliography
[1] M. Ahmed, M. T. Imtiaz, and R. Khan, “Movie recommendation system using clustering
and pattern recognition network,” in 2018 IEEE 8th Annual Computing and Communica-
tion Workshop and Conference (CCWC), pp. 143–147, IEEE, 2018.
[2] E. Brentari, L. Dancelli, and M. Manisera, “Clustering ranking data in market segmenta-
tion: a case study on the Italian McDonald’s customers’ preferences,” Journal of Applied
Statistics, vol. 43, no. 11, pp. 1959–1976, 2016.
[3] E. Ntoutsi, K. Stefanidis, K. Nørvaag, and H.-P. Kriegel, “Fast group recommendations by
applying user clustering,” in International conference on conceptual modeling, pp. 126–140,
Springer, 2012.
[4] A. Bonanomi, M. N. Ruscone, and S. A. Osmetti, “Defining subjects distance in hierarchical
cluster analysis by copula approach,” Quality & Quantity, vol. 51, no. 2, pp. 859–872, 2017.
[5] D. Müllensiefen, C. Hennig, and H. Howells, “Using clustering of rankings to explain brand
preferences with personality and socio-demographic variables,” Journal of Applied Statis-
tics, vol. 45, no. 6, pp. 1009–1029, 2018.
[6] B. Li, Y. Liao, and Z. Qin, “Precomputed clustering for movie recommendation system in
real time,” Journal of Applied Mathematics, vol. 2014, 2014.
[7] J. Yuan, Y. Zheng, and X. Xie, “Discovering regions of different functions in a city us-
ing human mobility and POIs,” in Proceedings of the 18th ACM SIGKDD international
conference on Knowledge discovery and data mining, pp. 186–194, 2012.
[8] X. Zhang, W. Li, F. Zhang, R. Liu, and Z. Du, “Identifying urban functional zones using
public bicycle rental records and point-of-interest data,” ISPRS International Journal of
Geo-Information, vol. 7, no. 12, p. 459, 2018.
[9] J. Díez, J. J. Del Coz, O. Luaces, and A. Bahamonde, “Clustering people according to
their preference criteria,” Expert Systems with Applications, vol. 34, no. 2, pp. 1274–1284,
2008.
[10] J. Jacques and C. Biernacki, “Model-based clustering for multivariate partial ranking data,”
Journal of Statistical Planning and Inference, vol. 149, pp. 201–217, 2014.
[11] C. Mollica and L. Tardella, “Bayesian Plackett–Luce mixture models for partially ranked
data,” Psychometrika, vol. 82, no. 2, pp. 442–458, 2017.
33

[12] V. Vitelli, Sørensen, M. Crispino, A. Frigessi, and E. Arjas, “Probabilistic preference
learning with the Mallows rank model,” The Journal of Machine Learning Research, vol. 18,
no. 1, pp. 5796–5844, 2017.
[13] T. Kamishima and S. Akaho, “Efficient clustering for orders,” in Mining complex data,
pp. 261–279, Springer, 2009.
[14] A. D’Ambrosio and W. J. Heiser, “A distribution-free soft-clustering method for preference
rankings,” Behaviormetrika, vol. 46, no. 2, pp. 333–351, 2019.
[15] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-
time object detection,” in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), June 2016.
[16] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection
with region proposal networks,” in Advances in neural information processing systems,
pp. 91–99, 2015.
[17] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detec-
tion,” in Proceedings of the IEEE international conference on computer vision, pp. 2980–
2988, 2017.
34

Master thesis

Recommended

Recommended

More Related Content

Similar to Master thesis

Similar to Master thesis (20)

More from Raouf KESKES

More from Raouf KESKES (7)

Recently uploaded

Recently uploaded (20)

Master thesis