Talk by Borys Biletskyy at Data Science Amsterdam and Data Science Utrecht. The talk is dedicated to the role of Machine Learning Engineer and how it can improve the success rate of Data Science projects.
1. Role of Machine Learning
Engineer
Borys Biletskyy
Data Science Amsterdam
28-05-2019
2. Agenda
1. About Myself
2. Motivation
3. Data Science Process
4. Roles in Data Analytics
5. 3 Challenges for ML Engineer
3. About Myself
● Software Engineer since 2004
○ Low level, C++ -> Enterprise, Java -> Data Driven, Scala
○ Dev, Tech Lead, Architect, Consultant
● Researcher since 2004
○ PhD in Theoretical Computer Science
○ Complexity and Scalability of ML Methods
● Machine Learning Engineer since 2017
○ Python, Scala
○ LeasePlan, Randstad, VodafoneZiggo
4. Motivation
● Low success rate of Data Analytics projects
○ Gartner: 60% of Data Analytics projects fail*
● General C-level recommendations
○ The Data Economy: Why do so many analytics projects fail?**
○ 8 Reasons why Data Analytics projects fail***
○ ...
● Often the problem is in a team structure
○ How Machine Learning Engineer role can help
* - https://www.techrepublic.com/article/85-of-big-data-projects-fail-but-your-developers-can-help-yours-succeed/
** - https://www.dataversity.net/many-data-analytics-projects-fail-save/
*** - https://www.eastbanctech.com/technology-insights/what-the-tech/why-so-many-analytics-projects-fail.html
7. Adv. Analytics Math/Stats ML/AI Scripting Programming Distributed Sys. Data Pipelines
Data Scientist & Data Engineer
● Fast insights driven
● Small applications
● Highly dynamic development
● Interactive notebook scripts
● Running on laptop
● Academic background
● Interacts with business/domain experts
● Agile
● Production systems
● QA and processes
● Modular, reusable, maintainable, scalable
● Running on cluster
● Engineering Background
● Interacts with platform engineers
8. Data Analytics Skills
Adv. Analytics Math/Stats ML/AI Scripting
Data Science Data Engineering
Programming Distributed Sys. Data Pipelines
* https://www.oreilly.com/ideas/data-engineers-vs-data-scientists
1DS ~ 5DE
DS DE DE DE DE DE
9. DataOps Teams*
1DS ~ 3DE
Adv. Analytics Math/Stats ML/AI Scripting
Data Science Data Engineering
Programming Distributed Sys. Data Pipelines
DS DE DE DE
10. DataOps Team
● DataOps Team
○ cross-functional
○ owns whole feature life cycle
○ dynamic
○ T-shaped
● Guilds & Feature Teams
● Data Platform AAS
○ Platform Engineers
Adv. Analytics Math/Stats ML/AI Scripting
Data Science Data Engineering
Programming Distributed Sys. Data Pipelines
11. Machine Learning Engineer Role (Fill The Gap)
Adv. Analytics Math/Stats ML/AI Scripting
Data Science Data Engineering
Programming Distributed Sys. Data Pipelines
12. Machine Learning Engineer Role (Coordinating)
Adv. Analytics Math/Stats ML/AI Scripting
Data Science Data Engineering
Programming Distributed Sys. Data Pipelines
13. ● Coordination
● Improve communication
● Guards pragmatic development standards
● Sets (Agile) processes
● Makes DE<->DS handover smooth
● Balances the number of DE’s and DS’s
● Can work in both disciplines
● ML Engineer specific skills:
○ Custom ML algorithms
○ Custom ML solutions
○ ML model logistics
○ ML pipelines
ML Engineer
DE DEDS DEMLDS DS
Adv. Analytics Math/Stats ML/AI Scripting
Data Science Data Engineering
Programming Distributed Sys. Data Pipelines
ML Engineering
15. Challenge 1: Data Platform
Define
Goal
Data
Collection
DS
Feature
Engineering
DS
Exploratory
Data
Analysis
DSDE
Data
Pre-Processing
DE
Deploy
Model
DE
Serve Model
(Request|Batch|Stream)
DE
Modeling
DS
Validation
DSDE
Monitor
DS DE
Poor data quality
16. Challenge 1: Data Platform
● Before:
○ Data samples insights
○ Different teams: DS, DE, PE
○ Unsynchronized sprints
○ Loss of Focus
○ Too long time to market
○ Different levels of problem solving
■ Connectivity (PE)
■ Data Ingestion (DE)
■ EDA & Feature Engineering (D)
● After:
○ Feature teams: DE, DS, ME (PE)
○ Continuous Data Platform Improvements
○ Unified:
■ Data storage
■ Data Ingestion
■ Data Pre-processing
○ Early data injection from new sources
○ All data is available for experimenting
○ Less rework and handover iterations
○ Faster time to market
17. Challenge 2: Scalability of ML Methods (Tools)
Define
Goal
Data
Collection
DS
Feature
Engineering
DS
Exploratory
Data
Analysis
DSDE
Data
Pre-Processing
DE
Deploy
Model
DE
Serve Model
(Request|Batch|Stream)
DE
Modeling
DS
Validation
DSDE
Monitor
DS DE
This method is
not scalable
18. Challenge 2: Scalability of ML Methods (Tools)
● Before:
○ Horizontally scalable Data Platform AAS
○ Different teams
■ Different tools and standards
■ Unsynchronized sprints
○ No DE-DS coordination before deployment
■ Rework iterations
○ Lack of understanding of scalability
■ horizontal / vertical
○ Lack of understanding of ML stages
■ training / scoring
○ Unscalable tools: scikit-learn, R
○ Unscalable methods: Neural Nets
● After:
○ Feature teams: DE, DS, ME (PE)
○ Shared codebase
○ Standardised tooling
○ Reusable building blocks for ML Pipelines:
■ Notebooks (easy to use)
■ Cluster (production ready)
○ Testing strategy
○ Automated Deployment
○ DS modifying and deploying ML pipelines
○ Faster time to market
19. Challenge 3: Model Serving
Define
Goal
Data
Collection
DS
Feature
Engineering
DS
Exploratory
Data
Analysis
DSDE
Data
Pre-Processing
DE
Deploy
Model
DE
Serve Model
(Request|Batch|Stream)
DE
Modeling
DS
Validation
DSDE
Monitor
DS DE
This model is too
slow for real-time
scoring
20. Challenge 3: Model Serving
● Before:
○ Single team: DS, DE
○ Lack of DS-DE coordination
○ Poorly scalable design
■ In-memory (big) data processing
○ Poorly scalable methods
■ Cos-nearest neighbors search
■ O(n) instead of const
○ Rework
○ Problems with real time scoring
● After:
○ Single team: DE, DS, ME
○ Models serving is planned early
○ Efficient refinements
○ Serving strategy drives solution design
○ Less rework
○ Faster time to market