DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat Tran | Idealo

Dat Tran - Head of Data Science
1
Dat Tran (Head of Data)
@datitran
Demystifying the Buzz in Machine
Learning! (This time for real)
23/11/2018 - Data Natives 2018 Berlin
#idealoTech

What do we do at idealo? Some examples...
3
Hotel image ranking for both aesthetic
and technical quality
Low-to-high resolution
Recommendation engine

Check us out! #idealoTech
4
https://github.com/idealo https://medium.com/idealo-tech-blog

Let me start with Gartner Hype Cycle…
...Why because they’re “always” right
6

Guidelines for successful
and realistic data projects
11

12
1. Think simple
first and then, if
it’s really
needed, get
more complex

Minimum Viable Model
Not like this…
Like this!
13

Problem Statement
● For over 50% of the lead-outs, we don’t know whether users bought or not
● We know it for Amazon & ebay but with a 2-days lag; other problems are
direct vs. indirect sales
● Predicting sales is valuable, for example for CRM, recommendation engine
and many other use cases
15

Supervised Learning
Samples
price: 80, pis: 5, ... sale
price: 5, pis: 1, ... non-sale
price: 17, pis: 3, ... sale
ML Model training Predictions
price: 99, pis: 8, ... non-sale
price: 65, pis: 2, ... sale (82%)
price: 32, pis: 9, ... sale (30%)
price: 40, pis: 5, ... sale (50%)
price: 20, pis: 2, ... sale (71%)
Deep Learning????
16

17
Interpretation of your models matters!

18
2. Define your
data product
MVP and release
as early as
possible

MVP for Recommendations
Not like this…
Like this!
19

Classifying Hotel Aesthetics Photos
20

Problem Statement
● 2.306.658 accommodations
● 308.519.299 images
● ~ 133 images per accommodation
Humans?
Deep Learning??
21

How to start a Deep Learning project
1. Computer Vision: ImageNet, AlexNet
2. NLP: Language models (still immature)
22

Automate Image Quality Assessment
To automate the image quality assessment we trained:
● Aesthetic model → Predicts aesthetic score of an image
● Technical model → Predicts the technical image quality (distortion, blur, etc.)
We followed the Google paper “NIMA: Neural Image Assessment” published 09/2017
23

Results - First Iteration
Aesthetic model - MobileNet
Linear correlation coefficient (LCC): 0.5987
Spearman's correlation coefficient (SCRR): 0.6072
Earth Mover's Distance: 0.2018
Accuracy (threshold at 5): 0.74
24

Example - First Iteration
25

Learnings
● First results are not good but we only learned it because we released it
○ More domain specific data is needed
● We could load test our applications which is very valuable
○ Used MobileNet instead of VGG-16
26

Second Iteration
● We built a simple labeling application
● ~ 12 people from idealo Reise and Data Science labeled
○ 1000 hotel images for aesthetics
○ 3000 hotel images for technical quality
● We fine-tuned the aesthetic model with 800 training
images
● Built aesthetic test dataset with 200 images
27

Example - Second Iteration
28

29
3. Creating data
products is a
team sport

UX/UI +
Frontend
engineer
Backend
engineer
Data
Scientist +
Data
Engineer
30

Google’s Smart Reply Feature Apple’s Smart Photo Search Feature
32

33
4. Use the right
tool for the right
problem

This is our tech stack... only an extract;)
PyData
Deep
Learning
Big Data
Computer
Vision
NLP
Production
Machine
Learning
Visualization
Data Preparation
35

Minimum Viable Platform
Not like this…
Like this!
37

39
6. Measure your
model and
improve it from
time to time

Hotel Image Tagging Pipeline Day 1
Bedroom
Bedroom
Bedroom
40

Hotel Image Tagging Pipeline Day 2
Bedroom
Bedroom
Reception???
41

● Data changes constantly so monitor your model performance on a regular
basis
● Re-training pipeline is also important
● Don’t do it manually, use appropriate tools for this e.g. Apache Airflow
Learnings
42

43
7. Your results
need to be
reproducible

Data Science Product Life Cycle
Feature Engineering
Modeling
Evaluation
Operationalization
Feedback
Data Review
API Design
Problem Definition
44

● Use git
● Dockerized aka containerized everything
● Use conda and/or pip for package management
● Automatic pipeline management (testing, data)
● TDD & API First strategy (everything as a Microservice)
● Don’t use Jupyter notebooks for production system
Learnings
45

46
8. Prioritize the
projects with
the biggest
business impact

2 x 2 Business Impact vs. Technical Feasibility
47

49
1. Think simple first and then, if it’s really needed, get more complex
2. Define your data product MVP and release as early as possible
3. Creating data products is a team sport
4. Use the right tool for the right problem
5. Use the cloud
6. Measure your model and improve it from time to time
7. Your results need to be reproducible
8. Prioritize the projects with the biggest business impact
Summary
49

50
Questions?
Url: www.dat-tran.com
Twitter: @datitran

DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat Tran | Idealo

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat Tran | Idealo

Ähnlich wie DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat Tran | Idealo (20)

Mehr von Dataconomy Media

Mehr von Dataconomy Media (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat Tran | Idealo