Automatic image moderation in classifieds

Automatic image moderation in
classifieds
By Jaroslaw Szymczak
PYDATA PARIS @ PYPARIS 2017
June 12, 2017

Agenda
● Image moderation problem
● Brief sketch of approach
● Machine learning foundations of the solution:
○ Image features
○ Listing features (and combination of both)
● Class imbalance problem:
○ proper training
○ proper testing
○ proper evaluation
● Going live with the product:
○ consistent development and production environments
○ batch model creation
○ live application
○ performance monitoring

Scale of business at OLX
4.4
APP
RATING
#1 app
+22 COUNTRIES (1)
1) Google play store; shopping/lifestyle categories
Note: excludes Letgo. Associates at proportionate share
→ People spend more than twice as long in
OLX apps versus competitors
became one of the top 3 classifieds app in US
less than a year after its launch
130 Countries
+60 million monthly listings
+18 million monthly sellers
+52 million cars are listed every year in our platforms;
77% of the total amount of cars manufactured!
+160,000 properties are listed daily
• 2 houses
• 2 cars
• 3 fashion items
• 2.5 mobile phones
At OLX, are listed every second:

✔ real photo of the phones
✔ selfie with a dress
✔ real shoes photo
✘ human on the picture (OLX India)
✘ stock photo (OLX Poland)
CALL 555-555-555
✘ contact details (all sites)
✘ NSFW (all sites)

Binary image classification
Image features:
● CNN fine tuning
● transfer learning
● image represented as 1D vector
Classic features:
● category of listing
● is listing from business of a private
person
● what is the price?
All fed to
Why not more, e.g. title, description, user history?
Because of pragmatism, we don’t want to overcomplicate the model:
● CNN are state of the art for image recognition
● classical features help in improving accuracy, but having too many of them would
decrease significance of image features

Classic image features
And many others, more or less sophisticated methods of feature extraction...

Convolutional Neural Networks
Source: lecture notes to Stanford Course CS231n: http://cs231n.stanford.edu/slides/2017

Fine tuning and transfer learning
Source: lecture notes to Stanford Course CS231n: http://cs231n.stanford.edu/slides/2017

Inception network
Source: http://redcatlabs.com/2016-07-30_FifthElephant-DeepLearning-Workshop/
Inception 21k
Trained on 21 841 classes on ImageNet set
Top-1 accuracy above 37%
Available for mxnet:
https://github.com/dmlc/mxnet-model-gallery/blob/master/imagenet-21k-inception.md

VGG16 network
Source: https://www.cs.toronto.edu/~frossard/post/vgg16/
● used model from Keras
● easy to freeze arbitrary layers (layer.trainable = False )

Listing features
With eXtreme Gradient Boosting (XGBoost)

Feature preparation
After encoding the “classic features” they are concatenated with image ones

Gradient boosting?
● instead of weights update in each round you try to fit the weak learner to
residuals of pseudo-residuals
● similarly like in neural networks, shrinkage parameter is used when
updating the algorithm to compensate for loss function

eXtreme Gradient Boosting (XGBoost)
Source:
https://www.slideshare.net/JaroslawSzymczak1/xgboost-the-algorithm-that-wins-every-competition

Class imbalance - proper training
● possibilities to deal with the problem:
○ undersampling majority class
○ oversampling minority class:
■ randomly
■ by creating artificial examples (SMOTE)
○ reweighting
● undersampling suits our needs the most
○ the general population of good images is not very much “hurt” by
undersampling
○ having training data size limitations we can train on more unique examples
of bad images
○ we undersample in such manner, that we change the ratio from 99:1 to 9:1

Use real-life
ratio
Class imbalance - proper testing

Class imbalance - proper evaluation
● accuracy is useless measure in such case
● sensible measures are:
○ ROC AUC
○ PR AUC
○ Precision @ fixed Recall
○ Recall @ fixed Precision
● ROC AUC:
○ can be interpreted as concordance probability (i.e. random positive example has the probability
equal to AUC, that it’s score is higher)
○ it is though too abstract to use as a standalone quality metric
○ does not depend on classes ratio
● PR AUC
○ Depends on data balance
○ Is not intuitively interpretable
● Precision @ fixed Recall, Recall @ fixed Precision:
○ they heavily depend on data balance
○ they are the best to reflect the business requirements
○ and to take into account processing capabilities (then actually Precision @k is more accurate)

ROC AUC - inception-21k and vgg16

Consistent
development and
production
environments
● ensure you have the drivers installed
nvidia-smi
● create docker image
FROM nvidia/cuda:8.0-cudnn5-devel-ubuntu16.04
...
ENV BUILD_OPTS "USE_CUDA=1
USE_CUDA_PATH=/usr/local/cuda USE_CUDNN=1"
RUN cd /home && git clone
https://github.com/dmlc/mxnet.git mxnet
--recursive --branch v0.10.0 --depth 1
&& cd mxnet && make -j$(nproc) $BUILD_OPTS
...
RUN pip3 install tensorflow==1.1.0
RUN pip3 install tensorflow-gpu==1.1.0
RUN pip3 install keras==2.0
● use nvidia-docker-compose wrapper

Batch process
with use of Luigi framework
● re-usability of processing
● fully automated pipeline
● contenerized with Docker

Luigi tips
● create your output at the very end of the task
● you can dynamically create dependencies by yielding the task
● adding workers parameter to your command parallelizes task that are ready to
be run (e.g. python run.py Task … --workers 15)
● for straightforward workflows inheritance comes handy:
class SimpleDependencyTask(luigi.Task):
def create_simple_dependency(self, predecessor_task_class,
additional_parameters_dict=None):
if additional_parameters_dict is None:
additional_parameters_dict = {}
result_dict = {k: v for k, v in self.__dict__.items() if
k in
predecessor_task_class.get_param_names()}
result_dict.update(additional_parameters_dict)
return predecessor_task_class(**result_dict)
ads_from_one_day = yield DownloadAdsFromOneDay(self.site_code,
effective_current_date)

Live process
with use of Flask
● hosted in AWS
● horizontally scaled
● contenerized with Docker

Performance monitoring (with Grafana)

Acknowledgements
● Vaibhav Singh
● Jaydeep De
● Andrzej Prałat
By Jaroslaw Szymczak PYDATA PARIS @ PYPARIS 2017
June 12, 2017

Automatic image moderation in classifieds

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Automatic image moderation in classifieds

Similar to Automatic image moderation in classifieds (20)

Recently uploaded

Recently uploaded (20)

Automatic image moderation in classifieds