How do organizations scale data services and data science teams effectively? What are the building blocks for that process and how can formal project management methodologies like Agile help to run data projects more efficiently.
2. Cameron Sim
CoFounder at CrewSpark
in/cameronsim
@cameronsim
Data Science Engineering
Big Data Architecture
Cloud Platforms
Consulting
CrewSpark
Collaborative Data Science Platform
www.crewspark.com
ML Modeling & Collaboration
Data Governance
CI/CD Model Framework
Realtime Notebooks
3. #1 – The problems with Data Science
#2 – How do we move forward
#3 – Beyond Projects
6. Organizational Challenges
• Hard to find skills
• Lack of data governance
• Siloed Intelligence
• Lack of maturity/understanding
• Organically managed
• Limited transparency
• Very hard to quantify effectiveness
7. Project Level Challenges
• Lack of mature methodologies
• Lack of knowledge/adoption
• Inconsistent task tracking
• No standard approaches to QA
• Multiple data science teams with different approaches
• No analytics outside of issue tracking
11. - Embrace Innovation
- Experiment Always
- Streamline, Automate, repeat
- Challenge the Status Quo
12. Data Driven Companies
• Culture of Experimentation
• Mature Data Governance & Access
• Common Toolsets
• Common Frameworks & Methodologies
13. Pyramid of
Innovation
Sporadic Databases Data in different formats Ad hoc Reporting
Federated Data Services
Self-Serve
Reporting
Data Sourcing
& Auto Processing
Information
Architecture
Master Data Management
Common Data Tooling Data Culture
Machine Learning
Frameworks
Industrial Experimentation
Predictive
Services
Culture of
Innovation
Foundation
Analytical
Scientific
16. Data Science Projects
• Projects incorporate data centric tasks
• Data is messy, unreliable
– that increases project risk
17. Agile Project Methodologies
• Design to identify problems early
• Well established within most organizations
• Malleable, easily integrated
18. Agile (Scrum, Lean, Kanban, XP) etc.
+
Approaches to (working with) Data
(CRISP-DM, KDD, SYMMA)
19. Approaches to working with Data
CRISP-DM – Cross Industry Standard Process for
Data Mining
SEMMA (SAS) – Sample Explore Modify Model
Access
KDD – Knowledge Discovery in Databases
25. Mapping Tasks to actual deliverables
Clean Data Create Model
TASKS
Train Model
Notebook Class/File Function
26. Value Added
• Projects are data-centric
• Methodology that addresses problems with data
• Creating re-usable assets/artefacts
• Organization has a consistent approach to
executing data science objectives.
27. A consistent approach leads to
a better understanding
of what is possible
…which leads to
increased productivity
across the organization
30. Data Standards
• Code quality, commenting & performance
• Centralized Function repository
• Documentation & data dictionaries for sourced & new
datasets
• Model testing standards (confusion matrix, Fscore etc.)
• Model CI/CD framework
31. Master Data Management
• Centralized assets store
• Version Control
• Asset level access management
• Security standards for data at rest, data in transit
• Federated data system
32. A Framework to Innovate
• Homegrown tools to test new theories
• A/B Testing in a sanctioned environment
• New product or service development
• Backed up by business justification / hard numbers
34. Performance
• Resource allocation & reporting (by the hour)
• Who is the best data scientist in the company?
• How many more data scientists do we need to do X
• What kinds of data scientists to we have – could we
organize the teams to better enable the organization?
35. Central Intelligence
• How many regression models do we have in Python
• Where are we using Neural Networks?
• Quickly bring up the model for X and self-audit.
• How accurate is model X and how has it changed over
time (Data Lineage).