SlideShare ist ein Scribd-Unternehmen logo
1 von 36
@ODSC
RUNNING
DATA SCIENCE
PROJECTS
& INTEGRATION WITHIN THE
ORGANIZATIONAL ECOSYSTEM
Boston | May 1 - 4 2018
Cameron Sim
CoFounder at CrewSpark
in/cameronsim
@cameronsim
Data Science Engineering
Big Data Architecture
Cloud Platforms
Consulting
CrewSpark
Collaborative Data Science Platform
www.crewspark.com
ML Modeling & Collaboration
Data Governance
CI/CD Model Framework
Realtime Notebooks
#1 – The problems with Data Science
#2 – How do we move forward
#3 – Beyond Projects
#1 – The problems with Data Science
The right answer to the wrong problem…
Organizational Challenges
• Hard to find skills
• Lack of data governance
• Siloed Intelligence
• Lack of maturity/understanding
• Organically managed
• Limited transparency
• Very hard to quantify effectiveness
Project Level Challenges
• Lack of mature methodologies
• Lack of knowledge/adoption
• Inconsistent task tracking
• No standard approaches to QA
• Multiple data science teams with different approaches
• No analytics outside of issue tracking
Tell me something I don’t know
“We run ads”
- Embrace Innovation
- Experiment Always
- Streamline, Automate, repeat
- Challenge the Status Quo
Data Driven Companies
• Culture of Experimentation
• Mature Data Governance & Access
• Common Toolsets
• Common Frameworks & Methodologies
Pyramid of
Innovation
Sporadic Databases Data in different formats Ad hoc Reporting
Federated Data Services
Self-Serve
Reporting
Data Sourcing
& Auto Processing
Information
Architecture
Master Data Management
Common Data Tooling Data Culture
Machine Learning
Frameworks
Industrial Experimentation
Predictive
Services
Culture of
Innovation
Foundation
Analytical
Scientific
#2 – How do we move forward
Projects drive culture,
…culture drives the organization.
Data Science Projects
• Projects incorporate data centric tasks
• Data is messy, unreliable
– that increases project risk
Agile Project Methodologies
• Design to identify problems early
• Well established within most organizations
• Malleable, easily integrated
Agile (Scrum, Lean, Kanban, XP) etc.
+
Approaches to (working with) Data
(CRISP-DM, KDD, SYMMA)
Approaches to working with Data
CRISP-DM – Cross Industry Standard Process for
Data Mining
SEMMA (SAS) – Sample Explore Modify Model
Access
KDD – Knowledge Discovery in Databases
CRISP-DM
Feature Matrix
CRISP-DM SEMMA KDD
Business Understanding
Data Understanding
Sample Selection
Explore Pre-Processing
Data Preparation Modify Transformation
Modeling
Model Data Mining
Evaluation Access Interpretation/Evaluation
Deployment
https://pdfs.semanticscholar.org/7dfe/3bc6035da527deaa72007a27cef94047a7f9.pdf
Agile + CRISP-DM?
Putting them together (using CRISP-DM)
Business
Understanding
Data Understanding
Data PreparationModeling
Evaluation
Deployment
EPICS Stories
Business
Understanding
Data Understanding
Data Preparation
Modeling & Evaluation (iterations)
Deployment
1 2
3
4
Example Project Lifecycle
Model
Train
Test
Mapping Tasks to actual deliverables
Clean Data Create Model
TASKS
Train Model
Notebook Class/File Function
Value Added
• Projects are data-centric
• Methodology that addresses problems with data
• Creating re-usable assets/artefacts
• Organization has a consistent approach to
executing data science objectives.
A consistent approach leads to
a better understanding
of what is possible
…which leads to
increased productivity
across the organization
#3 – Beyond projects
Data Standards
Master Data
Management
Tooling
API / Self-Serve Access Framework to Innovate
Data Standards
• Code quality, commenting & performance
• Centralized Function repository
• Documentation & data dictionaries for sourced & new
datasets
• Model testing standards (confusion matrix, Fscore etc.)
• Model CI/CD framework
Master Data Management
• Centralized assets store
• Version Control
• Asset level access management
• Security standards for data at rest, data in transit
• Federated data system
A Framework to Innovate
• Homegrown tools to test new theories
• A/B Testing in a sanctioned environment
• New product or service development
• Backed up by business justification / hard numbers
Performance Metrics & Central Intelligence
Performance
• Resource allocation & reporting (by the hour)
• Who is the best data scientist in the company?
• How many more data scientists do we need to do X
• What kinds of data scientists to we have – could we
organize the teams to better enable the organization?
Central Intelligence
• How many regression models do we have in Python
• Where are we using Neural Networks?
• Quickly bring up the model for X and self-audit.
• How accurate is model X and how has it changed over
time (Data Lineage).
Thank You

Weitere ähnliche Inhalte

Was ist angesagt?

Dataiku productive application to production - pap is may 2015
Dataiku    productive application to production - pap is may 2015 Dataiku    productive application to production - pap is may 2015
Dataiku productive application to production - pap is may 2015
Dataiku
 

Was ist angesagt? (20)

The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016
 
Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...
Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...
Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...
 
Building Data Science into Organizations: Field Experience
Building Data Science into Organizations: Field ExperienceBuilding Data Science into Organizations: Field Experience
Building Data Science into Organizations: Field Experience
 
Building the Artificially Intelligent Enterprise
Building the Artificially Intelligent EnterpriseBuilding the Artificially Intelligent Enterprise
Building the Artificially Intelligent Enterprise
 
Dataiku productive application to production - pap is may 2015
Dataiku    productive application to production - pap is may 2015 Dataiku    productive application to production - pap is may 2015
Dataiku productive application to production - pap is may 2015
 
Rsqrd AI: How to Design a Reliable and Reproducible Pipeline
Rsqrd AI: How to Design a Reliable and Reproducible PipelineRsqrd AI: How to Design a Reliable and Reproducible Pipeline
Rsqrd AI: How to Design a Reliable and Reproducible Pipeline
 
Real-time Recommendations for Retail: Architecture, Algorithms, and Design
Real-time Recommendations for Retail: Architecture, Algorithms, and DesignReal-time Recommendations for Retail: Architecture, Algorithms, and Design
Real-time Recommendations for Retail: Architecture, Algorithms, and Design
 
Understanding DataOps and Its Impact on Application Quality
Understanding DataOps and Its Impact on Application QualityUnderstanding DataOps and Its Impact on Application Quality
Understanding DataOps and Its Impact on Application Quality
 
Building data "Py-pelines"
Building data "Py-pelines"Building data "Py-pelines"
Building data "Py-pelines"
 
H2O for Medicine and Intro to H2O in Python
H2O for Medicine and Intro to H2O in PythonH2O for Medicine and Intro to H2O in Python
H2O for Medicine and Intro to H2O in Python
 
Managing Enterprise Data Science 201904
Managing Enterprise Data Science 201904Managing Enterprise Data Science 201904
Managing Enterprise Data Science 201904
 
An Ounce of Prevention: Forging Healthy BI
An Ounce of Prevention: Forging Healthy BIAn Ounce of Prevention: Forging Healthy BI
An Ounce of Prevention: Forging Healthy BI
 
Spark Summit EU 2017 - Preventing revenue leakage and monitoring distributed ...
Spark Summit EU 2017 - Preventing revenue leakage and monitoring distributed ...Spark Summit EU 2017 - Preventing revenue leakage and monitoring distributed ...
Spark Summit EU 2017 - Preventing revenue leakage and monitoring distributed ...
 
Before Kaggle
Before KaggleBefore Kaggle
Before Kaggle
 
Smarter Analytics: Supporting the Enterprise with Automation
Smarter Analytics: Supporting the Enterprise with AutomationSmarter Analytics: Supporting the Enterprise with Automation
Smarter Analytics: Supporting the Enterprise with Automation
 
H2O World - Collaborative, Reproducible Research with H2O - Nick Elprin
H2O World - Collaborative, Reproducible Research with H2O - Nick ElprinH2O World - Collaborative, Reproducible Research with H2O - Nick Elprin
H2O World - Collaborative, Reproducible Research with H2O - Nick Elprin
 
Dsc 2021 presentation_radovan_bacovic
Dsc 2021 presentation_radovan_bacovicDsc 2021 presentation_radovan_bacovic
Dsc 2021 presentation_radovan_bacovic
 
AI Data Acquisition and Governance: Considerations for Success
AI Data Acquisition and Governance: Considerations for SuccessAI Data Acquisition and Governance: Considerations for Success
AI Data Acquisition and Governance: Considerations for Success
 
Commercializing Alternative Data
Commercializing Alternative DataCommercializing Alternative Data
Commercializing Alternative Data
 
Driverless AI Hands-on Focused on Machine Learning Interpretability - H2O.ai
Driverless AI Hands-on Focused on Machine Learning Interpretability - H2O.aiDriverless AI Hands-on Focused on Machine Learning Interpretability - H2O.ai
Driverless AI Hands-on Focused on Machine Learning Interpretability - H2O.ai
 

Ähnlich wie ODSC East 2018

Ähnlich wie ODSC East 2018 (20)

Building enterprise advance analytics platform
Building enterprise advance analytics platformBuilding enterprise advance analytics platform
Building enterprise advance analytics platform
 
Data-Ed Slides: Data Modeling Strategies - Getting Your Data Ready for the Ca...
Data-Ed Slides: Data Modeling Strategies - Getting Your Data Ready for the Ca...Data-Ed Slides: Data Modeling Strategies - Getting Your Data Ready for the Ca...
Data-Ed Slides: Data Modeling Strategies - Getting Your Data Ready for the Ca...
 
Tips and Tricks to be an Effective Data Scientist
Tips and Tricks to be an Effective Data ScientistTips and Tricks to be an Effective Data Scientist
Tips and Tricks to be an Effective Data Scientist
 
Data Analytics: From Basic Skills to Executive Decision-Making
Data Analytics: From Basic Skills to Executive Decision-MakingData Analytics: From Basic Skills to Executive Decision-Making
Data Analytics: From Basic Skills to Executive Decision-Making
 
Enterprise Data Architecture Deliverables
Enterprise Data Architecture DeliverablesEnterprise Data Architecture Deliverables
Enterprise Data Architecture Deliverables
 
KSU IT Capstone Report 2012-2017.pdf
KSU IT Capstone Report 2012-2017.pdfKSU IT Capstone Report 2012-2017.pdf
KSU IT Capstone Report 2012-2017.pdf
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
 
KSU IT4983 Capstone Projects Report 2017 Update
KSU IT4983 Capstone Projects Report 2017 UpdateKSU IT4983 Capstone Projects Report 2017 Update
KSU IT4983 Capstone Projects Report 2017 Update
 
Tips for Effective Data Science in the Enterprise
Tips for Effective Data Science in the EnterpriseTips for Effective Data Science in the Enterprise
Tips for Effective Data Science in the Enterprise
 
Which institute is best for data science?
Which institute is best for data science?Which institute is best for data science?
Which institute is best for data science?
 
Best Selenium certification course
Best Selenium certification courseBest Selenium certification course
Best Selenium certification course
 
Data science training in hyd ppt (1)
Data science training in hyd ppt (1)Data science training in hyd ppt (1)
Data science training in hyd ppt (1)
 
Data science training institute in hyderabad
Data science training institute in hyderabadData science training institute in hyderabad
Data science training institute in hyderabad
 
Data science training in Hyderabad
Data science  training in HyderabadData science  training in Hyderabad
Data science training in Hyderabad
 
Data science training Hyderabad
Data science training HyderabadData science training Hyderabad
Data science training Hyderabad
 
Data science online training in hyderabad
Data science online training in hyderabadData science online training in hyderabad
Data science online training in hyderabad
 
Data science training in hyd ppt (1)
Data science training in hyd ppt (1)Data science training in hyd ppt (1)
Data science training in hyd ppt (1)
 
data science training and placement
data science training and placementdata science training and placement
data science training and placement
 
online data science training
online data science trainingonline data science training
online data science training
 

Kürzlich hochgeladen

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Kürzlich hochgeladen (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

ODSC East 2018

  • 1. @ODSC RUNNING DATA SCIENCE PROJECTS & INTEGRATION WITHIN THE ORGANIZATIONAL ECOSYSTEM Boston | May 1 - 4 2018
  • 2. Cameron Sim CoFounder at CrewSpark in/cameronsim @cameronsim Data Science Engineering Big Data Architecture Cloud Platforms Consulting CrewSpark Collaborative Data Science Platform www.crewspark.com ML Modeling & Collaboration Data Governance CI/CD Model Framework Realtime Notebooks
  • 3. #1 – The problems with Data Science #2 – How do we move forward #3 – Beyond Projects
  • 4. #1 – The problems with Data Science
  • 5. The right answer to the wrong problem…
  • 6. Organizational Challenges • Hard to find skills • Lack of data governance • Siloed Intelligence • Lack of maturity/understanding • Organically managed • Limited transparency • Very hard to quantify effectiveness
  • 7. Project Level Challenges • Lack of mature methodologies • Lack of knowledge/adoption • Inconsistent task tracking • No standard approaches to QA • Multiple data science teams with different approaches • No analytics outside of issue tracking
  • 8. Tell me something I don’t know
  • 10.
  • 11. - Embrace Innovation - Experiment Always - Streamline, Automate, repeat - Challenge the Status Quo
  • 12. Data Driven Companies • Culture of Experimentation • Mature Data Governance & Access • Common Toolsets • Common Frameworks & Methodologies
  • 13. Pyramid of Innovation Sporadic Databases Data in different formats Ad hoc Reporting Federated Data Services Self-Serve Reporting Data Sourcing & Auto Processing Information Architecture Master Data Management Common Data Tooling Data Culture Machine Learning Frameworks Industrial Experimentation Predictive Services Culture of Innovation Foundation Analytical Scientific
  • 14. #2 – How do we move forward
  • 15. Projects drive culture, …culture drives the organization.
  • 16. Data Science Projects • Projects incorporate data centric tasks • Data is messy, unreliable – that increases project risk
  • 17. Agile Project Methodologies • Design to identify problems early • Well established within most organizations • Malleable, easily integrated
  • 18. Agile (Scrum, Lean, Kanban, XP) etc. + Approaches to (working with) Data (CRISP-DM, KDD, SYMMA)
  • 19. Approaches to working with Data CRISP-DM – Cross Industry Standard Process for Data Mining SEMMA (SAS) – Sample Explore Modify Model Access KDD – Knowledge Discovery in Databases
  • 21. Feature Matrix CRISP-DM SEMMA KDD Business Understanding Data Understanding Sample Selection Explore Pre-Processing Data Preparation Modify Transformation Modeling Model Data Mining Evaluation Access Interpretation/Evaluation Deployment https://pdfs.semanticscholar.org/7dfe/3bc6035da527deaa72007a27cef94047a7f9.pdf
  • 23. Putting them together (using CRISP-DM) Business Understanding Data Understanding Data PreparationModeling Evaluation Deployment EPICS Stories
  • 24. Business Understanding Data Understanding Data Preparation Modeling & Evaluation (iterations) Deployment 1 2 3 4 Example Project Lifecycle Model Train Test
  • 25. Mapping Tasks to actual deliverables Clean Data Create Model TASKS Train Model Notebook Class/File Function
  • 26. Value Added • Projects are data-centric • Methodology that addresses problems with data • Creating re-usable assets/artefacts • Organization has a consistent approach to executing data science objectives.
  • 27. A consistent approach leads to a better understanding of what is possible …which leads to increased productivity across the organization
  • 28. #3 – Beyond projects
  • 29. Data Standards Master Data Management Tooling API / Self-Serve Access Framework to Innovate
  • 30. Data Standards • Code quality, commenting & performance • Centralized Function repository • Documentation & data dictionaries for sourced & new datasets • Model testing standards (confusion matrix, Fscore etc.) • Model CI/CD framework
  • 31. Master Data Management • Centralized assets store • Version Control • Asset level access management • Security standards for data at rest, data in transit • Federated data system
  • 32. A Framework to Innovate • Homegrown tools to test new theories • A/B Testing in a sanctioned environment • New product or service development • Backed up by business justification / hard numbers
  • 33. Performance Metrics & Central Intelligence
  • 34. Performance • Resource allocation & reporting (by the hour) • Who is the best data scientist in the company? • How many more data scientists do we need to do X • What kinds of data scientists to we have – could we organize the teams to better enable the organization?
  • 35. Central Intelligence • How many regression models do we have in Python • Where are we using Neural Networks? • Quickly bring up the model for X and self-audit. • How accurate is model X and how has it changed over time (Data Lineage).