SlideShare a Scribd company logo
1 of 37
Download to read offline
Automatic image moderation in
classifieds
By Jaroslaw Szymczak
PYDATA PARIS @ PYPARIS 2017
June 12, 2017
Agenda
● Image moderation problem
● Brief sketch of approach
● Machine learning foundations of the solution:
○ Image features
○ Listing features (and combination of both)
● Class imbalance problem:
○ proper training
○ proper testing
○ proper evaluation
● Going live with the product:
○ consistent development and production environments
○ batch model creation
○ live application
○ performance monitoring
Image moderation
problem
Scale of business at OLX
4.4
APP
RATING
#1 app
+22 COUNTRIES (1)
1) Google play store; shopping/lifestyle categories
Note: excludes Letgo. Associates at proportionate share
→ People spend more than twice as long in
OLX apps versus competitors
became one of the top 3 classifieds app in US
less than a year after its launch
130 Countries
+60 million monthly listings
+18 million monthly sellers
+52 million cars are listed every year in our platforms;
77% of the total amount of cars manufactured!
+160,000 properties are listed daily
• 2 houses
• 2 cars
• 3 fashion items
• 2.5 mobile phones
At OLX, are listed every second:
✔ real photo of the phones
✔ selfie with a dress
✔ real shoes photo
✘ human on the picture (OLX India)
✘ stock photo (OLX Poland)
CALL 555-555-555
✘ contact details (all sites)
✘ NSFW (all sites)
Brief sketch of
approach
Binary image classification
Image features:
● CNN fine tuning
● transfer learning
● image represented as 1D vector
Classic features:
● category of listing
● is listing from business of a private
person
● what is the price?
All fed to
Why not more, e.g. title, description, user history?
Because of pragmatism, we don’t want to overcomplicate the model:
● CNN are state of the art for image recognition
● classical features help in improving accuracy, but having too many of them would
decrease significance of image features
Image features
Classic image features
And many others, more or less sophisticated methods of feature extraction...
Convolutional Neural Networks
Source: lecture notes to Stanford Course CS231n: http://cs231n.stanford.edu/slides/2017
Fine tuning and transfer learning
Source: lecture notes to Stanford Course CS231n: http://cs231n.stanford.edu/slides/2017
Inception network
Source: http://redcatlabs.com/2016-07-30_FifthElephant-DeepLearning-Workshop/
Inception 21k
Trained on 21 841 classes on ImageNet set
Top-1 accuracy above 37%
Available for mxnet:
https://github.com/dmlc/mxnet-model-gallery/blob/master/imagenet-21k-inception.md
VGG16 network
Source: https://www.cs.toronto.edu/~frossard/post/vgg16/
● used model from Keras
● easy to freeze arbitrary layers (layer.trainable = False )
Listing features
With eXtreme Gradient Boosting (XGBoost)
Feature preparation
After encoding the “classic features” they are concatenated with image ones
Adaptive Boosting
Gradient boosting?
● instead of weights update in each round you try to fit the weak learner to
residuals of pseudo-residuals
● similarly like in neural networks, shrinkage parameter is used when
updating the algorithm to compensate for loss function
eXtreme Gradient Boosting (XGBoost)
Source:
https://www.slideshare.net/JaroslawSzymczak1/xgboost-the-algorithm-that-wins-every-competition
Class imbalance
problem
Class imbalance - proper training
● possibilities to deal with the problem:
○ undersampling majority class
○ oversampling minority class:
■ randomly
■ by creating artificial examples (SMOTE)
○ reweighting
● undersampling suits our needs the most
○ the general population of good images is not very much “hurt” by
undersampling
○ having training data size limitations we can train on more unique examples
of bad images
○ we undersample in such manner, that we change the ratio from 99:1 to 9:1
Use real-life
ratio
Class imbalance - proper testing
Class imbalance - proper evaluation
● accuracy is useless measure in such case
● sensible measures are:
○ ROC AUC
○ PR AUC
○ Precision @ fixed Recall
○ Recall @ fixed Precision
● ROC AUC:
○ can be interpreted as concordance probability (i.e. random positive example has the probability
equal to AUC, that it’s score is higher)
○ it is though too abstract to use as a standalone quality metric
○ does not depend on classes ratio
● PR AUC
○ Depends on data balance
○ Is not intuitively interpretable
● Precision @ fixed Recall, Recall @ fixed Precision:
○ they heavily depend on data balance
○ they are the best to reflect the business requirements
○ and to take into account processing capabilities (then actually Precision @k is more accurate)
ROC AUC - inception-21k and vgg16
PR AUC - inception-21k
PR AUC - vgg16
Going live with the
product
Consistent
development and
production
environments
● ensure you have the drivers installed
nvidia-smi
● create docker image
FROM nvidia/cuda:8.0-cudnn5-devel-ubuntu16.04
...
ENV BUILD_OPTS "USE_CUDA=1
USE_CUDA_PATH=/usr/local/cuda USE_CUDNN=1"
RUN cd /home && git clone
https://github.com/dmlc/mxnet.git mxnet
--recursive --branch v0.10.0 --depth 1 
&& cd mxnet && make -j$(nproc) $BUILD_OPTS
...
RUN pip3 install tensorflow==1.1.0
RUN pip3 install tensorflow-gpu==1.1.0
RUN pip3 install keras==2.0
● use nvidia-docker-compose wrapper
Batch process
with use of Luigi framework
● re-usability of processing
● fully automated pipeline
● contenerized with Docker
Luigi Task
Luigi Dashboard
Luigi Task Visualizer
Luigi tips
● create your output at the very end of the task
● you can dynamically create dependencies by yielding the task
● adding workers parameter to your command parallelizes task that are ready to
be run (e.g. python run.py Task … --workers 15)
● for straightforward workflows inheritance comes handy:
class SimpleDependencyTask(luigi.Task):
def create_simple_dependency(self, predecessor_task_class,
additional_parameters_dict=None):
if additional_parameters_dict is None:
additional_parameters_dict = {}
result_dict = {k: v for k, v in self.__dict__.items() if
k in
predecessor_task_class.get_param_names()}
result_dict.update(additional_parameters_dict)
return predecessor_task_class(**result_dict)
ads_from_one_day = yield DownloadAdsFromOneDay(self.site_code,
effective_current_date)
Live process
with use of Flask
● hosted in AWS
● horizontally scaled
● contenerized with Docker
Live service architecture
Performance
monitoring
Performance monitoring (with Grafana)
Acknowledgements
● Vaibhav Singh
● Jaydeep De
● Andrzej Prałat
By Jaroslaw Szymczak PYDATA PARIS @ PYPARIS 2017
June 12, 2017

More Related Content

What's hot

What's hot (18)

Feature engineering pipelines
Feature engineering pipelinesFeature engineering pipelines
Feature engineering pipelines
 
Creating Your First Predictive Model In Python
Creating Your First Predictive Model In PythonCreating Your First Predictive Model In Python
Creating Your First Predictive Model In Python
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
XGBoost @ Fyber
XGBoost @ FyberXGBoost @ Fyber
XGBoost @ Fyber
 
J unit introduction
J unit introductionJ unit introduction
J unit introduction
 
Embedded based retrieval in modern search ranking system
Embedded based retrieval in modern search ranking systemEmbedded based retrieval in modern search ranking system
Embedded based retrieval in modern search ranking system
 
Ad science bid simulator (public ver)
Ad science bid simulator (public ver)Ad science bid simulator (public ver)
Ad science bid simulator (public ver)
 
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle ContestDA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
 
Introduction to Machine Learning in Python using Scikit-Learn
Introduction to Machine Learning in Python using Scikit-LearnIntroduction to Machine Learning in Python using Scikit-Learn
Introduction to Machine Learning in Python using Scikit-Learn
 
Braden Hancock "Programmatically creating and managing training data with Sno...
Braden Hancock "Programmatically creating and managing training data with Sno...Braden Hancock "Programmatically creating and managing training data with Sno...
Braden Hancock "Programmatically creating and managing training data with Sno...
 
Supervised embedding techniques in search ranking system
Supervised embedding techniques in search ranking systemSupervised embedding techniques in search ranking system
Supervised embedding techniques in search ranking system
 
Introduction of Feature Hashing
Introduction of Feature HashingIntroduction of Feature Hashing
Introduction of Feature Hashing
 
XGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competition
 
Universal job embedding in recommendation (public ver.)
Universal job embedding in recommendation (public ver.)Universal job embedding in recommendation (public ver.)
Universal job embedding in recommendation (public ver.)
 
Faster and cheaper, smart ab experiments - public ver.
Faster and cheaper, smart ab experiments - public ver.Faster and cheaper, smart ab experiments - public ver.
Faster and cheaper, smart ab experiments - public ver.
 
Ensembling & Boosting 概念介紹
Ensembling & Boosting  概念介紹Ensembling & Boosting  概念介紹
Ensembling & Boosting 概念介紹
 
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
 
A Neural Autoregressive Approach to Collaborative Filtering (CF-NADE) Slide
A Neural Autoregressive Approach to Collaborative Filtering (CF-NADE) Slide A Neural Autoregressive Approach to Collaborative Filtering (CF-NADE) Slide
A Neural Autoregressive Approach to Collaborative Filtering (CF-NADE) Slide
 

Similar to Automatic image moderation in classifieds

DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...
Dataconomy Media
 
mloc.js 2014 - JavaScript and the browser as a platform for game development
mloc.js 2014 - JavaScript and the browser as a platform for game developmentmloc.js 2014 - JavaScript and the browser as a platform for game development
mloc.js 2014 - JavaScript and the browser as a platform for game development
David Galeano
 
Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...
Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...
Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...
Henning Jacobs
 

Similar to Automatic image moderation in classifieds (20)

Project report
Project reportProject report
Project report
 
Workshop About Software Engineering Skills 2019
Workshop About Software Engineering Skills 2019Workshop About Software Engineering Skills 2019
Workshop About Software Engineering Skills 2019
 
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...
 
Deep Learning on AWS (November 2016)
Deep Learning on AWS (November 2016)Deep Learning on AWS (November 2016)
Deep Learning on AWS (November 2016)
 
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
 
Rise of the machines: Continuous Delivery at SEEK - YOW! Night Summary Slides
Rise of the machines: Continuous Delivery at SEEK - YOW! Night Summary SlidesRise of the machines: Continuous Delivery at SEEK - YOW! Night Summary Slides
Rise of the machines: Continuous Delivery at SEEK - YOW! Night Summary Slides
 
Android Overview
Android OverviewAndroid Overview
Android Overview
 
Sticky Notes - a tool for supporting collaborative activities in a 3D virtual...
Sticky Notes - a tool for supporting collaborative activities in a 3D virtual...Sticky Notes - a tool for supporting collaborative activities in a 3D virtual...
Sticky Notes - a tool for supporting collaborative activities in a 3D virtual...
 
CityEngine-OpenDS
CityEngine-OpenDSCityEngine-OpenDS
CityEngine-OpenDS
 
mloc.js 2014 - JavaScript and the browser as a platform for game development
mloc.js 2014 - JavaScript and the browser as a platform for game developmentmloc.js 2014 - JavaScript and the browser as a platform for game development
mloc.js 2014 - JavaScript and the browser as a platform for game development
 
Developing Spatial Applications with CARTO for React v1.1
Developing Spatial Applications with CARTO for React v1.1Developing Spatial Applications with CARTO for React v1.1
Developing Spatial Applications with CARTO for React v1.1
 
Devoxx : being productive with JHipster
Devoxx : being productive with JHipsterDevoxx : being productive with JHipster
Devoxx : being productive with JHipster
 
Pitfalls of machine learning in production
Pitfalls of machine learning in productionPitfalls of machine learning in production
Pitfalls of machine learning in production
 
Interactive Image Processing Demos for the Web
Interactive Image Processing Demos for the WebInteractive Image Processing Demos for the Web
Interactive Image Processing Demos for the Web
 
Kubernetes Deployments: A "Hands-off" Approach
Kubernetes Deployments: A "Hands-off" ApproachKubernetes Deployments: A "Hands-off" Approach
Kubernetes Deployments: A "Hands-off" Approach
 
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
 
QuestMark Framework for Dhis2 Android Apps - Dhis2 symposium 2018
QuestMark Framework for Dhis2 Android Apps - Dhis2 symposium 2018QuestMark Framework for Dhis2 Android Apps - Dhis2 symposium 2018
QuestMark Framework for Dhis2 Android Apps - Dhis2 symposium 2018
 
JUNIPER: Towards Modeling Approach Enabling Efficient Platform for Heterogene...
JUNIPER: Towards Modeling Approach Enabling Efficient Platform for Heterogene...JUNIPER: Towards Modeling Approach Enabling Efficient Platform for Heterogene...
JUNIPER: Towards Modeling Approach Enabling Efficient Platform for Heterogene...
 
Secure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budget
 
Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...
Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...
Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...
 

Recently uploaded

Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Recently uploaded (20)

CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 

Automatic image moderation in classifieds

  • 1. Automatic image moderation in classifieds By Jaroslaw Szymczak PYDATA PARIS @ PYPARIS 2017 June 12, 2017
  • 2. Agenda ● Image moderation problem ● Brief sketch of approach ● Machine learning foundations of the solution: ○ Image features ○ Listing features (and combination of both) ● Class imbalance problem: ○ proper training ○ proper testing ○ proper evaluation ● Going live with the product: ○ consistent development and production environments ○ batch model creation ○ live application ○ performance monitoring
  • 4. Scale of business at OLX 4.4 APP RATING #1 app +22 COUNTRIES (1) 1) Google play store; shopping/lifestyle categories Note: excludes Letgo. Associates at proportionate share → People spend more than twice as long in OLX apps versus competitors became one of the top 3 classifieds app in US less than a year after its launch 130 Countries +60 million monthly listings +18 million monthly sellers +52 million cars are listed every year in our platforms; 77% of the total amount of cars manufactured! +160,000 properties are listed daily • 2 houses • 2 cars • 3 fashion items • 2.5 mobile phones At OLX, are listed every second:
  • 5. ✔ real photo of the phones ✔ selfie with a dress ✔ real shoes photo ✘ human on the picture (OLX India) ✘ stock photo (OLX Poland) CALL 555-555-555 ✘ contact details (all sites) ✘ NSFW (all sites)
  • 7. Binary image classification Image features: ● CNN fine tuning ● transfer learning ● image represented as 1D vector Classic features: ● category of listing ● is listing from business of a private person ● what is the price? All fed to Why not more, e.g. title, description, user history? Because of pragmatism, we don’t want to overcomplicate the model: ● CNN are state of the art for image recognition ● classical features help in improving accuracy, but having too many of them would decrease significance of image features
  • 9. Classic image features And many others, more or less sophisticated methods of feature extraction...
  • 10. Convolutional Neural Networks Source: lecture notes to Stanford Course CS231n: http://cs231n.stanford.edu/slides/2017
  • 11. Fine tuning and transfer learning Source: lecture notes to Stanford Course CS231n: http://cs231n.stanford.edu/slides/2017
  • 12. Inception network Source: http://redcatlabs.com/2016-07-30_FifthElephant-DeepLearning-Workshop/ Inception 21k Trained on 21 841 classes on ImageNet set Top-1 accuracy above 37% Available for mxnet: https://github.com/dmlc/mxnet-model-gallery/blob/master/imagenet-21k-inception.md
  • 13. VGG16 network Source: https://www.cs.toronto.edu/~frossard/post/vgg16/ ● used model from Keras ● easy to freeze arbitrary layers (layer.trainable = False )
  • 14. Listing features With eXtreme Gradient Boosting (XGBoost)
  • 15. Feature preparation After encoding the “classic features” they are concatenated with image ones
  • 17. Gradient boosting? ● instead of weights update in each round you try to fit the weak learner to residuals of pseudo-residuals ● similarly like in neural networks, shrinkage parameter is used when updating the algorithm to compensate for loss function
  • 18. eXtreme Gradient Boosting (XGBoost) Source: https://www.slideshare.net/JaroslawSzymczak1/xgboost-the-algorithm-that-wins-every-competition
  • 20. Class imbalance - proper training ● possibilities to deal with the problem: ○ undersampling majority class ○ oversampling minority class: ■ randomly ■ by creating artificial examples (SMOTE) ○ reweighting ● undersampling suits our needs the most ○ the general population of good images is not very much “hurt” by undersampling ○ having training data size limitations we can train on more unique examples of bad images ○ we undersample in such manner, that we change the ratio from 99:1 to 9:1
  • 22. Class imbalance - proper evaluation ● accuracy is useless measure in such case ● sensible measures are: ○ ROC AUC ○ PR AUC ○ Precision @ fixed Recall ○ Recall @ fixed Precision ● ROC AUC: ○ can be interpreted as concordance probability (i.e. random positive example has the probability equal to AUC, that it’s score is higher) ○ it is though too abstract to use as a standalone quality metric ○ does not depend on classes ratio ● PR AUC ○ Depends on data balance ○ Is not intuitively interpretable ● Precision @ fixed Recall, Recall @ fixed Precision: ○ they heavily depend on data balance ○ they are the best to reflect the business requirements ○ and to take into account processing capabilities (then actually Precision @k is more accurate)
  • 23. ROC AUC - inception-21k and vgg16
  • 24. PR AUC - inception-21k
  • 25. PR AUC - vgg16
  • 26. Going live with the product
  • 27. Consistent development and production environments ● ensure you have the drivers installed nvidia-smi ● create docker image FROM nvidia/cuda:8.0-cudnn5-devel-ubuntu16.04 ... ENV BUILD_OPTS "USE_CUDA=1 USE_CUDA_PATH=/usr/local/cuda USE_CUDNN=1" RUN cd /home && git clone https://github.com/dmlc/mxnet.git mxnet --recursive --branch v0.10.0 --depth 1 && cd mxnet && make -j$(nproc) $BUILD_OPTS ... RUN pip3 install tensorflow==1.1.0 RUN pip3 install tensorflow-gpu==1.1.0 RUN pip3 install keras==2.0 ● use nvidia-docker-compose wrapper
  • 28. Batch process with use of Luigi framework ● re-usability of processing ● fully automated pipeline ● contenerized with Docker
  • 32. Luigi tips ● create your output at the very end of the task ● you can dynamically create dependencies by yielding the task ● adding workers parameter to your command parallelizes task that are ready to be run (e.g. python run.py Task … --workers 15) ● for straightforward workflows inheritance comes handy: class SimpleDependencyTask(luigi.Task): def create_simple_dependency(self, predecessor_task_class, additional_parameters_dict=None): if additional_parameters_dict is None: additional_parameters_dict = {} result_dict = {k: v for k, v in self.__dict__.items() if k in predecessor_task_class.get_param_names()} result_dict.update(additional_parameters_dict) return predecessor_task_class(**result_dict) ads_from_one_day = yield DownloadAdsFromOneDay(self.site_code, effective_current_date)
  • 33. Live process with use of Flask ● hosted in AWS ● horizontally scaled ● contenerized with Docker
  • 37. Acknowledgements ● Vaibhav Singh ● Jaydeep De ● Andrzej Prałat By Jaroslaw Szymczak PYDATA PARIS @ PYPARIS 2017 June 12, 2017