ODSC East 2018

@ODSC
RUNNING
DATA SCIENCE
PROJECTS
& INTEGRATION WITHIN THE
ORGANIZATIONAL ECOSYSTEM
Boston | May 1 - 4 2018

Cameron Sim
CoFounder at CrewSpark
in/cameronsim
@cameronsim
Data Science Engineering
Big Data Architecture
Cloud Platforms
Consulting
CrewSpark
Collaborative Data Science Platform
www.crewspark.com
ML Modeling & Collaboration
Data Governance
CI/CD Model Framework
Realtime Notebooks

#1 – The problems with Data Science
#2 – How do we move forward
#3 – Beyond Projects

#1 – The problems with Data Science

The right answer to the wrong problem…

Organizational Challenges
• Hard to find skills
• Lack of data governance
• Siloed Intelligence
• Lack of maturity/understanding
• Organically managed
• Limited transparency
• Very hard to quantify effectiveness

Project Level Challenges
• Lack of mature methodologies
• Lack of knowledge/adoption
• Inconsistent task tracking
• No standard approaches to QA
• Multiple data science teams with different approaches
• No analytics outside of issue tracking

Tell me something I don’t know

- Embrace Innovation
- Experiment Always
- Streamline, Automate, repeat
- Challenge the Status Quo

Data Driven Companies
• Culture of Experimentation
• Mature Data Governance & Access
• Common Toolsets
• Common Frameworks & Methodologies

Pyramid of
Innovation
Sporadic Databases Data in different formats Ad hoc Reporting
Federated Data Services
Self-Serve
Reporting
Data Sourcing
& Auto Processing
Information
Architecture
Master Data Management
Common Data Tooling Data Culture
Machine Learning
Frameworks
Industrial Experimentation
Predictive
Services
Culture of
Innovation
Foundation
Analytical
Scientific

Projects drive culture,
…culture drives the organization.

Data Science Projects
• Projects incorporate data centric tasks
• Data is messy, unreliable
– that increases project risk

Agile Project Methodologies
• Design to identify problems early
• Well established within most organizations
• Malleable, easily integrated

Agile (Scrum, Lean, Kanban, XP) etc.
+
Approaches to (working with) Data
(CRISP-DM, KDD, SYMMA)

Approaches to working with Data
CRISP-DM – Cross Industry Standard Process for
Data Mining
SEMMA (SAS) – Sample Explore Modify Model
Access
KDD – Knowledge Discovery in Databases

Feature Matrix
CRISP-DM SEMMA KDD
Business Understanding
Data Understanding
Sample Selection
Explore Pre-Processing
Data Preparation Modify Transformation
Modeling
Model Data Mining
Evaluation Access Interpretation/Evaluation
Deployment
https://pdfs.semanticscholar.org/7dfe/3bc6035da527deaa72007a27cef94047a7f9.pdf

Putting them together (using CRISP-DM)
Business
Understanding
Data Understanding
Data PreparationModeling
Evaluation
Deployment
EPICS Stories

Business
Understanding
Data Understanding
Data Preparation
Modeling & Evaluation (iterations)
Deployment
1 2
3
4
Example Project Lifecycle
Model
Train
Test

Mapping Tasks to actual deliverables
Clean Data Create Model
TASKS
Train Model
Notebook Class/File Function

Value Added
• Projects are data-centric
• Methodology that addresses problems with data
• Creating re-usable assets/artefacts
• Organization has a consistent approach to
executing data science objectives.

A consistent approach leads to
a better understanding
of what is possible
…which leads to
increased productivity
across the organization

Data Standards
Master Data
Management
Tooling
API / Self-Serve Access Framework to Innovate

Data Standards
• Code quality, commenting & performance
• Centralized Function repository
• Documentation & data dictionaries for sourced & new
datasets
• Model testing standards (confusion matrix, Fscore etc.)
• Model CI/CD framework

Master Data Management
• Centralized assets store
• Version Control
• Asset level access management
• Security standards for data at rest, data in transit
• Federated data system

A Framework to Innovate
• Homegrown tools to test new theories
• A/B Testing in a sanctioned environment
• New product or service development
• Backed up by business justification / hard numbers

Performance Metrics & Central Intelligence

Performance
• Resource allocation & reporting (by the hour)
• Who is the best data scientist in the company?
• How many more data scientists do we need to do X
• What kinds of data scientists to we have – could we
organize the teams to better enable the organization?

Central Intelligence
• How many regression models do we have in Python
• Where are we using Neural Networks?
• Quickly bring up the model for X and self-audit.
• How accurate is model X and how has it changed over
time (Data Lineage).

ODSC East 2018

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie ODSC East 2018

Ähnlich wie ODSC East 2018 (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

ODSC East 2018