Agile Data Science on Greenplum Using Airflow - Greenplum Summit 2019

© Copyright 2019 Pivotal Software, Inc. All rights Reserved.
Ambarish Joshi - Data Scientist Aditya Padhye - Data Engineer
Agile Data Science on Greenplum
Using Airflow

“Agile software development refers to a group of software
development methodologies based on iterative
development, where requirements and solutions evolve
through collaboration between self-organizing cross-
functional teams.”

Data Science Phases
Discovery Phase Operationalization (O16n)
Phase

Data Science Phases
Discovery
Phase
Data exploration &
cleaning
Feature engineering
Model Building
Model Evaluation

Data Science Phases - Agility
Discovery
Phase
Rapid Iteration
and
Experimentation
Data exploration &
cleaning
Feature engineering
Model Building
Model Evaluation

Greenplum Database
Segment Host
Segment
Segment
Segment Host
Segment
Segment
Standby
Master
…
Master
Host
SQL
Interconnect
Segment Host
Node1
Segment Host
Node2
Segment Host
Node3
Segment Host
NodeN
● MPP database based on
Postgres
● In database analytics
● Parallel architecture

Data Pipelines
Testing
Monitoring
● APIs to consume model
output
Data Science Phases
Operationalization
(O16n) Phase

Operationalization
(O16n) Phase
Automated manageable pipelines
Testing with CI
Monitoring to react to Failures
Madlib Flow Talk by Frank and
Jarrod
Data Pipelines
Testing
Monitoring
● APIs to consume model
output

Airflow
● Apache Project spun
out of Airbnb
● “Airflow is a platform to
programmatically
author, schedule and
monitor workflows.”

Data Science Use-Case
● The Data
○ Time-series trajectories with
latitude and longitude of
location.
○ Subset of trajectories are labeled as
walk / not walk
● Our Model
○ Build Classification model using
labelled data to identify if new
unlabeled trajectories are walk or
not walk
`
Example trajectories

Example data
Example trajectory data
Example label data
We have mode labels
of walk and not walk
only for subset of
incoming daily
trajectories

Discovery phase → Operationalization phase
After every model iteration we check if the model is viable
● Check the quantitative metrics of the model like AUC, ROC curve,
accuracy etc
● Check the qualitative results of the model and if it make sense to a
subject matter expert
Once we are convinced that the model is both quantitatively
and qualitatively viable we can move to the
Operationalization phase

Discovery phase → Operationalization phase
Example of code from the discovery phase which is converted into a task
script

Architecture overview
Modular idempotent tasks Connect tasks to create
automated workflows
Model refitting
workflow
output
output
New data
inference workflow
TBD: Expose
model results
using an API
Operationalization Phase
Model
Evaluation
Model
Building
Feature
Engineering
Data
Exploration
Iterative
Discover
y
Phase
output
ML Model + Data
pre-processing
Discovery Phase

Pipelines
Testing
Monitoring
APIs to consume model
output
Operationalization
(O16n) Phase
Automated manageable Pipelines
Testing with CI/CD
Monitoring to React to Failures
Jarrod

End to End Directed Acyclic Graph (DAG)
Fetch
Clean
Transform
Feature Engineering
Extract labelled data for
model creation/refitting
Inference for
unlabelled data

Data Prep and Feature Engineering
Fetch
Clean
Transform
Feature Engineering
Inference for
unlabelled data

Model Scoring DAG
Fetch
Clean
Transform
Feature Engineering
Inference for
unlabelled data

© Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved.
Data Prep and Feature Engineering -
Demo

● This DAG has a single task for model
training
● In this task we split the data into train
and test samples, train the model,
evaluate the model and capture the
accuracy, auc and model tables.
● We want all of the above to run at the
same time
Model Training DAG

Model Training DAG - Demo

● The unlabeled data which is extracted from
the features table is scored in this DAG
● We first check if any model has been built
● If there is a model so we score the data
(inference)
Model Scoring DAG

Model Scoring DAG - Demo

● Daily we get some more labeled data, once we have accumulated
enough labeled data we can retrain the model for better accuracy
● We have scheduled model re-training monthly
Model Re-Training

Model Re-Training - Demo

Testing with CI/CD
● Testing Data Pipelines is hard
● Test Coverage (Test Tasks vs Test DAGs)
● Testing as part of the CI/CD

Testing with CI/CD - Demo

Pipelines
Testing
Monitoring
APIs to consume model
output
Operationalization (O16n)
Phase
Automated manageable Pipelines
Testing with CI/CD
Monitoring to React to Failures
Jarrod

● Monitoring and error fixing is big part of responsive data pipelines
● Ability to quickly identify what is failing, why it is failing and fixing
it with minimum lead time is crucial
● In this demo we will showcase an error fixing case
Monitoring and Error Fixing

Monitoring and Error Fixing - Demo

Greenplum and Jupyter notebooks provides a set of tools
to do Agile Data Science during discovery phase
Greenplum along with Airflow and Circle CI is very
effective to do Agile Data Science during the
operationalization phase
Conclusion

“We partner to help you
compete, grow, and transform.”

Demo Flow
1) 4/12 - 5/5 -> pre-demo
2) Insert row -> pre-demo
3) 5/5 - 5/10 -> data prep and feature engg
4) Model training (show geolife.models_metadata table before and after running)
5) Model scoring 5/10 - 5/12 (show geolife.walk_prediction_results and count increase
after each DAG run)
6) Model re-training (show geolife.models_metadata again)
7) Monitoring & Error Fixing 5/13 - 5/14 (uncomment line and rerun)
8) Testing - TDB

Agile Data Science on Greenplum Using Airflow - Greenplum Summit 2019

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Agile Data Science on Greenplum Using Airflow - Greenplum Summit 2019

Ähnlich wie Agile Data Science on Greenplum Using Airflow - Greenplum Summit 2019 (20)

Mehr von VMware Tanzu

Mehr von VMware Tanzu (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Agile Data Science on Greenplum Using Airflow - Greenplum Summit 2019