Abstract of the Presentation:
When Dat Tran started his data science career in 2013, everyone was into big data. In fact, big data was at the peak of inflated expectations (according to Gartner). You had to use tools like Hadoop and Spark to be one of the cool kids. Many data prophets out there told you that data is the new oil, or even gold. Year 2018, things haven’t changed. Data is still cool and going strong. It’s eating the world- and yes, you still need big data, and now also deep deep very deep learning. There’s a lot of bullshit bingo out there.
In this talk, Dat Tran wants to demystify the buzz in machine learning by presenting some simple guidelines for successful data projects and real practical use cases. He will also share use cases from idealo, Germany’s largest price comparison service. And yes it involves deep learning, and yes it can be quite technical sometimes as well.
About the Author:
Dat Tran is currently co-heading the data team at idealo.de, where he leads a team of Data Scientists and Data Engineers. His aim is to turn idealo into a machine learning powerhouse. His research interests are diverse, from traditional machine learning to deep learning. Previously, he worked for Pivotal Labs and Accenture. He is a regular speaker and has presented at PyData and Cloud Foundry Summit. He also blogs about his work on Medium. His background is in Operations Research and Econometrics. Dat received his MSc in Economics from Humboldt University of Berlin.
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat Tran | Idealo
1. Dat Tran - Head of Data Science
1
Dat Tran (Head of Data)
@datitran
Demystifying the Buzz in Machine
Learning! (This time for real)
23/11/2018 - Data Natives 2018 Berlin
#idealoTech
3. What do we do at idealo? Some examples...
3
Hotel image ranking for both aesthetic
and technical quality
Low-to-high resolution
Recommendation engine
4. Check us out! #idealoTech
4
https://github.com/idealo https://medium.com/idealo-tech-blog
15. Problem Statement
● For over 50% of the lead-outs, we don’t know whether users bought or not
● We know it for Amazon & ebay but with a 2-days lag; other problems are
direct vs. indirect sales
● Predicting sales is valuable, for example for CRM, recommendation engine
and many other use cases
15
16. Supervised Learning
Samples
price: 80, pis: 5, ... sale
price: 5, pis: 1, ... non-sale
price: 17, pis: 3, ... sale
ML Model training Predictions
price: 99, pis: 8, ... non-sale
price: 65, pis: 2, ... sale (82%)
price: 32, pis: 9, ... sale (30%)
price: 40, pis: 5, ... sale (50%)
price: 20, pis: 2, ... sale (71%)
Deep Learning????
16
21. Problem Statement
● 2.306.658 accommodations
● 308.519.299 images
● ~ 133 images per accommodation
Humans?
Deep Learning??
21
22. How to start a Deep Learning project
1. Computer Vision: ImageNet, AlexNet
2. NLP: Language models (still immature)
22
23. Automate Image Quality Assessment
To automate the image quality assessment we trained:
● Aesthetic model → Predicts aesthetic score of an image
● Technical model → Predicts the technical image quality (distortion, blur, etc.)
We followed the Google paper “NIMA: Neural Image Assessment” published 09/2017
23
24. Results - First Iteration
Aesthetic model - MobileNet
Linear correlation coefficient (LCC): 0.5987
Spearman's correlation coefficient (SCRR): 0.6072
Earth Mover's Distance: 0.2018
Accuracy (threshold at 5): 0.74
24
26. Learnings
● First results are not good but we only learned it because we released it
○ More domain specific data is needed
● We could load test our applications which is very valuable
○ Used MobileNet instead of VGG-16
26
27. Second Iteration
● We built a simple labeling application
● ~ 12 people from idealo Reise and Data Science labeled
○ 1000 hotel images for aesthetics
○ 3000 hotel images for technical quality
● We fine-tuned the aesthetic model with 800 training
images
● Built aesthetic test dataset with 200 images
27
35. This is our tech stack... only an extract;)
PyData
Deep
Learning
Big Data
Computer
Vision
NLP
Production
Machine
Learning
Visualization
Data Preparation
35
42. ● Data changes constantly so monitor your model performance on a regular
basis
● Re-training pipeline is also important
● Don’t do it manually, use appropriate tools for this e.g. Apache Airflow
Learnings
42
44. Data Science Product Life Cycle
Feature Engineering
Modeling
Evaluation
Operationalization
Feedback
Data Review
API Design
Problem Definition
44
45. ● Use git
● Dockerized aka containerized everything
● Use conda and/or pip for package management
● Automatic pipeline management (testing, data)
● TDD & API First strategy (everything as a Microservice)
● Don’t use Jupyter notebooks for production system
Learnings
45
49. 49
1. Think simple first and then, if it’s really needed, get more complex
2. Define your data product MVP and release as early as possible
3. Creating data products is a team sport
4. Use the right tool for the right problem
5. Use the cloud
6. Measure your model and improve it from time to time
7. Your results need to be reproducible
8. Prioritize the projects with the biggest business impact
Summary
49