2. Agenda
● Image moderation problem
● Brief sketch of approach
● Machine learning foundations of the solution:
○ Image features
○ Listing features (and combination of both)
● Class imbalance problem:
○ proper training
○ proper testing
○ proper evaluation
● Going live with the product:
○ consistent development and production environments
○ batch model creation
○ live application
○ performance monitoring
4. Scale of business at OLX
4.4
APP
RATING
#1 app
+22 COUNTRIES (1)
1) Google play store; shopping/lifestyle categories
Note: excludes Letgo. Associates at proportionate share
→ People spend more than twice as long in
OLX apps versus competitors
became one of the top 3 classifieds app in US
less than a year after its launch
130 Countries
+60 million monthly listings
+18 million monthly sellers
+52 million cars are listed every year in our platforms;
77% of the total amount of cars manufactured!
+160,000 properties are listed daily
• 2 houses
• 2 cars
• 3 fashion items
• 2.5 mobile phones
At OLX, are listed every second:
5. ✔ real photo of the phones
✔ selfie with a dress
✔ real shoes photo
✘ human on the picture (OLX India)
✘ stock photo (OLX Poland)
CALL 555-555-555
✘ contact details (all sites)
✘ NSFW (all sites)
7. Binary image classification
Image features:
● CNN fine tuning
● transfer learning
● image represented as 1D vector
Classic features:
● category of listing
● is listing from business of a private
person
● what is the price?
All fed to
Why not more, e.g. title, description, user history?
Because of pragmatism, we don’t want to overcomplicate the model:
● CNN are state of the art for image recognition
● classical features help in improving accuracy, but having too many of them would
decrease significance of image features
17. Gradient boosting?
● instead of weights update in each round you try to fit the weak learner to
residuals of pseudo-residuals
● similarly like in neural networks, shrinkage parameter is used when
updating the algorithm to compensate for loss function
20. Class imbalance - proper training
● possibilities to deal with the problem:
○ undersampling majority class
○ oversampling minority class:
■ randomly
■ by creating artificial examples (SMOTE)
○ reweighting
● undersampling suits our needs the most
○ the general population of good images is not very much “hurt” by
undersampling
○ having training data size limitations we can train on more unique examples
of bad images
○ we undersample in such manner, that we change the ratio from 99:1 to 9:1
22. Class imbalance - proper evaluation
● accuracy is useless measure in such case
● sensible measures are:
○ ROC AUC
○ PR AUC
○ Precision @ fixed Recall
○ Recall @ fixed Precision
● ROC AUC:
○ can be interpreted as concordance probability (i.e. random positive example has the probability
equal to AUC, that it’s score is higher)
○ it is though too abstract to use as a standalone quality metric
○ does not depend on classes ratio
● PR AUC
○ Depends on data balance
○ Is not intuitively interpretable
● Precision @ fixed Recall, Recall @ fixed Precision:
○ they heavily depend on data balance
○ they are the best to reflect the business requirements
○ and to take into account processing capabilities (then actually Precision @k is more accurate)
27. Consistent
development and
production
environments
● ensure you have the drivers installed
nvidia-smi
● create docker image
FROM nvidia/cuda:8.0-cudnn5-devel-ubuntu16.04
...
ENV BUILD_OPTS "USE_CUDA=1
USE_CUDA_PATH=/usr/local/cuda USE_CUDNN=1"
RUN cd /home && git clone
https://github.com/dmlc/mxnet.git mxnet
--recursive --branch v0.10.0 --depth 1
&& cd mxnet && make -j$(nproc) $BUILD_OPTS
...
RUN pip3 install tensorflow==1.1.0
RUN pip3 install tensorflow-gpu==1.1.0
RUN pip3 install keras==2.0
● use nvidia-docker-compose wrapper
28. Batch process
with use of Luigi framework
● re-usability of processing
● fully automated pipeline
● contenerized with Docker
32. Luigi tips
● create your output at the very end of the task
● you can dynamically create dependencies by yielding the task
● adding workers parameter to your command parallelizes task that are ready to
be run (e.g. python run.py Task … --workers 15)
● for straightforward workflows inheritance comes handy:
class SimpleDependencyTask(luigi.Task):
def create_simple_dependency(self, predecessor_task_class,
additional_parameters_dict=None):
if additional_parameters_dict is None:
additional_parameters_dict = {}
result_dict = {k: v for k, v in self.__dict__.items() if
k in
predecessor_task_class.get_param_names()}
result_dict.update(additional_parameters_dict)
return predecessor_task_class(**result_dict)
ads_from_one_day = yield DownloadAdsFromOneDay(self.site_code,
effective_current_date)
33. Live process
with use of Flask
● hosted in AWS
● horizontally scaled
● contenerized with Docker