Join the journey of a data scientist on the way to industrialization... From notebook to proof of concept, from proof of concept to production, we will cover what happened at Air France. It won’t be golden rules, but a true story. What is exactly industrializing data science? How to package data science models? How to articulate data scientists and data engineers roles? Is continuous integration a wild dream for data scientists? This journey will feed you with key concepts which worked at Air France, and might give you a new light to guide you through your own data science journey.
Pauline Ballereau - Air France & Nicolas Laille - Xebia
https://dataxday.fr/
video available: https://www.youtube.com/watch?v=ESx6wR6g4ukx
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
DataXDay - A data scientist journey to industrialization of machine learning
1. A DATA SCIENTIST JOURNEY TO
INDUSTRIALIZATION OF MACHINE
LEARNING MODELS
DataXDay 2018
17th May 2018
2. @DataXDay
DATA SCIENCE
FOUNDATIONS FOR DATA SCIENCE AT AIR FRANCE
3
Adoption of Operations
Research for crew
scheduling
Extension to other
business domains:
Revenue Management,
Cargo, Ground
services, …
Adoption of
Hadoop
Focus on Machine
Learning
Ops Research is
now 120 engineers
in Paris and
Amsterdam
Adoption of data science within AFKL IT
was favored by existing Operations Research practice
3. @DataXDay
DATA SCIENCE
MACHINE LEARNING, SPONSORED BY ORGANIZATION
4
Organization, through Customer Data Management, is one of the key sponsors of
industrialized data science within AFKL
Customer
Data
Management
Customer data
strategy
Customer
knowledge
PersonalizationCoordinates IT efforts
4. @DataXDay
DATA SCIENCE
STARTING POINT FOR DATA SCIENCE PROJECT IS A POC LOGIC
DWH
Historical
Data
Business
Intelligence
LOCAL
Data
Sample
Proof of
Concept
5
5. @DataXDay
DATA SCIENCE
WHAT IS AN « INDUSTRIALIZED » ENGINE?
Jupyter notebook, R Executable package
On my own
Integrated within AFKL IT
live ecosystem
Manual launch or crontab
Automated calibration and
prediction
I guess my code is flawless Unit tested
Theoretical performance
Live feedback on
performance
6
8. @DataXDay
DATA SCIENTISTS X DATA ENGINEERS
IT TAKES TWO TO BRING DATA PRODUCTS LIVE (AT LEAST)
9
PoC
Start of
industrialization
Help!
How to ingest and
expose data?
Live
Product
V1
Translates
business ideas into
data science
Stats,
ML, AI
Data Scientist
Dev,
Big data,
project
architecture
Data Engineer
9. @DataXDay
DATA SCIENTISTS X DATA ENGINEERS
KEEP THE FRONTIER LOOSE
10
Data scientist and data engineer
are roles, not persons
Awareness of data scientist role on
live environments is key
10. @DataXDay
LIVE
Data feed
DATA SCIENTISTS X DATA ENGINEERS
A LIVE ECOSYSTEM
DWH
Historical
Data
Business
Intelligente
EXPLORATION
Historical
Data
Proof of
Concept
MODELS
Repository
Predictions
DATA API
11
12. @DataXDay
PACKAGING DATA SCIENCE
WHAT DO YOU EXPECT?
13
Features
engineering
Algorithm « Model »
Model Training data
Trained
model
Trained
model
Prediction
data
Predictions
Setup
Train
Predict
We are expecting two main functionalities, training and predicting
13. @DataXDay
PACKAGING DATA SCIENCE
STANDARDIZATION WITH THE PIPELINE PATTERN
14
LogisticRegressionModel
.transform(dataset)
LogisticRegression
.fit(dataset)
Model training
Dataset
Dataset
+
Predictions
SQLTransformer VectorAssembler
Feature Engineering
Pipeline Model
14. @DataXDay
PACKAGING DATA SCIENCE
PEX, JUST LIKE UBERJAR
15
PEX
Project
package
External
packages
Company
packages
Company
packages
Company
packages
Company
packages
External
packages
External
packages
External
packages
15. @DataXDay
LIVE
Data feed
PACKAGING DATA SCIENCE
A LIVE ECOSYSTEM
DWH
Historical
Data
Business
Intelligente
EXPLORATION
Historical
Data
Proof of
Concept
MODELS
Repository
Predictions
DATA
16
API
16. @DataXDay
LIVE
Data feed
PACKAGING DATA SCIENCE
A LIVE ECOSYSTEM… BUT TRAINING DATA AND LIVE DATA ARE DIFFERENT
DWH
Historical
Data
Business
Intelligente
EXPLORATION
Historical
Data
Proof of
Concept
MODELS
Repository
Predictions
DATA
17
API
18. @DataXDay
FROM DWH TO DATALAKE
TRAINING DATA MUST BE THE SAME AS PRODUCTION
• Data warehouse has a full historical data
• Production platform processes just what is
needed from raw data for live apps
• Data processing on both side are not
identical
• Production platform has to create a full
historical data
19
19. @DataXDay
LIVE
Data feed
FROM DWH TO DATALAKE
FROM A HISTORICAL/LIVE SYSTEM
DWH
Historical
Data
Business
Intelligente
EXPLORATION
Historical
Data
Proof of
Concept
MODELS
Repository
Predictions
DATA API
20
20. @DataXDay
LIVE
FROM DWH TO DATALAKE
TO A FULL LIVE SYSTEM
EXPLORATION
Historical
Data
Proof of
Concept
Predictions
DATA
21
Data feed Historical
Data
API
MODELS
Repository
22. @DataXDay
CONTINUOUS IMPROVEMENT
FROM BUD TO FLOWER
• Ease to deploy new model
• Ease to extract new feature
• Ease to access new data
• Stay innovative
• Time To Market
23
24. @DataXDay
Goal
Make sure each code modification is
not breaking anything
What to do ?
Regularly fetch sources, build project
and run tests
Needs
Tools to automate all tedious
and repetitive tasks
Because we are lazy
CONTINUOUS IMPROVEMENT
CONTINUOUS INTEGRATION
25
26. @DataXDay
CONTINUOUS IMPROVEMENT
TRACK MODEL VERSIONING
• Calibration meta data
• Dataset used
• Timestamp + Code version
• Keep track between models and
predictions
• Model used
• Unique ID of prediction
• Input dataset
27
29. @DataXDay
NEXT STEP
TOO MANY JOURNEYS
• How to maintain the momentum, after few
teams started the adventure ?
• Every teams experienced a different
journey
• But every teams find different paths
30