SlideShare ist ein Scribd-Unternehmen logo
1 von 21
WHY SO MANY
DATA SCIENCE
PROJECTS FAIL?
Ethan Ram / Aug. 2018
1
• Between 70% to 80% of
corporate business
intelligence projects fail
(Gartner)
• 55% of big data projects are
never finished (Inforchimps)
• Only 13% of organizations
achieve full-scale production
for their in-house big-data
implementations (Qubole)
• And the results…
DATA SCIENCE PROJECTS FAIL…
9/3/2018 Why So Many Data Science Projects Fail 2
Top of the list of
developers who said they
are looking for a new job*:
• ML specialists - 14.3%
• Data scientists - 13.2%
9/3/2018 Why So Many Data Science Projects Fail 3
“I HATE THIS JOB!”
* 2018 Stack Overflow survey based on 64,000 developers’ answers
Business
objective and
plan
Build
dataset
Model data
and validate
Implement
application
Deploy
Monitor,
measure &
optimize
We’ll look at some common failures in each step and
suggest better approaches.
DATA SCIENCE APPLICATION LIFECYCLE
9/3/2018 Why So Many Data Science Projects Fail 4
•First day success
•No false-positives
•100% accuracy
•No business value expected
•Expecting that the ML itself
would be the product
•Not defining the deliverable
9/3/2018 Why So Many Data Science Projects Fail 5
BUSINESS OBJECTIVE FAILURES
• Google “fixed” its “racist” algorithm
by removing gorillas from its image-
labeling tech
CAN YOU AFFORD A FALSE POSITIVE?
•Very few business’ core product is
AI/ML/Data based
•Most use those tools to improve
their bottom lines with existing
products
BE REALISTIC!
9/3/2018 Why So Many Data Science Projects Fail 7
1. Descriptive analysis (offline report)
2. Dashboard (real-time system)
3. Automated decision making system (“self
driving” system)
4. Dataset with specific qualities (to be used by
another ML)
Define: leverage, friction to impact and
cleanness
5. Methodology (dataset >> model)
6. Framework (API/SDK to build methodologies)
7. Proof-of-concept (proof a viable methodology)
TYPES OF DELIVERABLES
9/3/2018 Why So Many Data Science Projects Fail 8
•Missing diversity in the team
•In many projects 80% of work is
working on the dataset!
•It’s a *research* project!
•Short time to delivery
PLANNING FAILURES
Drue Conaway: Data Science Diagram
9/3/2018 Why So Many Data Science Projects Fail 9
Engineering
YOLO V3 NETWORK ARCHITECTURE
•Too little data to
build on
•Dataset is dirty
•Missing data from
the field
DATA INVENTORY FAILURES
9/3/2018 Why So Many Data Science Projects Fail 11
9/3/2018 Why So Many Data Science Projects Fail 12
DIRTY DATASET: NEGATIVE INFLUENCE
Data-set includes
negative influence
examples
Resulting
Classification
(with confidence)
9/3/2018 Why So Many Data Science Projects Fail 13
DATA MODELING FAILURESYou need to be
able to understand
the result! •Jumping to conclusions on what
the data is
•Assuming it works based on a
small sample
•Feedback-loop in results
•Missing cross validation
•Choosing algorithms that are too
heavy for the application
Supervised
learning
Classification
Linear classifiers
/ Fisher's
discriminant
Support vector
machines /
Least squares
Quadratic
classifiers
Kernel
estimation
K-nearest
neighbor
Regression
Linear
Regression
Logistic
Regression
CART
Naïve Bayes
Ensemble
Bagging with
Random Forests
Boosting with
XGBoost
Unsupervised
learning
Association
Apriori
K-means
Clustering
Mean-Shift
Density-Based
Spatial
EM-GMM
Agglomerative
Hierarchical
Dimensionality
Reduction
Feature
Selection
Variance
Thresholds
Correlation
Thresholds
Genetic
Algorithms (GA)
Stepwise Search
Feature
extraction
PCA
Linear
Discriminant
Analysis (LDA)
Autoencoders
Reinforcement
learning
Exploration
a.Criterion of
optimality
a.Brute force
a.Value function
a.Direct policy
search
9/3/2018 Why So Many Data Science Projects Fail 14
Application
Class
Algorithms
ML ALGORITHMS [PARTIAL] MAP
Boosting
Decision trees
Random forests
Neural networks
Learning vector
quantization
•Requesting the Data Scientists
team to build the application…
•Not testing to scale
•Switching from monitoring to
automatic action-taking too fast
•Missing safeguards on output
•Not preparing for attack!
APPLICATION FAILURES
9/3/2018 Why So Many Data Science Projects Fail 15
9/3/2018 Why So Many Data Science Projects Fail 16
DIRECT ATTACK EXAMPLE

9/3/2018 Why So Many Data Science Projects Fail 17
SYNTHESIZED ADVERSARIAL EXAMPLE
“WE HAVEN’T SEEN ANYTHING LIKE THIS BEFORE…”
9/3/2018 Why So Many Data Science Projects Fail 18
•Assuming it just works…
• Not having a long enough
beta
• Missing feedback from real
users
•Missing KPIs
• Measure business success
• Find false-positives
•Missing A-B testing built-in
MONITOR > MEASURE > OPTIMIZE FAILURES
9/3/2018 Why So Many Data Science Projects Fail 19
"Right now, a lot of our AI
systems make decisions in
ways that people don't really
understand… And I don't
think that… we want to end
up with systems that people
don't understand how they're
making decisions.“
• ZUCKERBERG at Senate
hearing 10-Apr-18
9/3/2018 Why So Many Data Science Projects Fail 20
Business
objective and
plan
Build
dataset
Model data
and validate
Implement
application
Deploy
Monitor,
measure &
optimize
DATA SCIENCE APPLICATION LIFECYCLE
•Q&A
9/3/2018 Why So Many Data Science Projects Fail 21

Weitere ähnliche Inhalte

Was ist angesagt?

Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Platform for Enterprise Use - Strata London 2018Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Platform for Enterprise Use - Strata London 2018
mark madsen
 
Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Architecting a Data Platform For Enterprise Use (Strata NY 2018)Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Architecting a Data Platform For Enterprise Use (Strata NY 2018)
mark madsen
 
Pay no attention to the man behind the curtain - the unseen work behind data ...
Pay no attention to the man behind the curtain - the unseen work behind data ...Pay no attention to the man behind the curtain - the unseen work behind data ...
Pay no attention to the man behind the curtain - the unseen work behind data ...
mark madsen
 
Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...
Domino Data Lab
 
20151016 Data Science For Project Managers
20151016 Data Science For Project Managers20151016 Data Science For Project Managers
20151016 Data Science For Project Managers
Tze-Yiu Yong
 
The Black Box: Interpretability, Reproducibility, and Data Management
The Black Box: Interpretability, Reproducibility, and Data ManagementThe Black Box: Interpretability, Reproducibility, and Data Management
The Black Box: Interpretability, Reproducibility, and Data Management
mark madsen
 

Was ist angesagt? (20)

Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Platform for Enterprise Use - Strata London 2018Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Platform for Enterprise Use - Strata London 2018
 
Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Architecting a Data Platform For Enterprise Use (Strata NY 2018)Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Architecting a Data Platform For Enterprise Use (Strata NY 2018)
 
[Infographic] Uniting Internet of Things and Big Data
[Infographic] Uniting Internet of Things and Big Data[Infographic] Uniting Internet of Things and Big Data
[Infographic] Uniting Internet of Things and Big Data
 
Max Cottica slides from Future of Business Intelligence
Max Cottica slides from Future of Business Intelligence Max Cottica slides from Future of Business Intelligence
Max Cottica slides from Future of Business Intelligence
 
Pay no attention to the man behind the curtain - the unseen work behind data ...
Pay no attention to the man behind the curtain - the unseen work behind data ...Pay no attention to the man behind the curtain - the unseen work behind data ...
Pay no attention to the man behind the curtain - the unseen work behind data ...
 
Focus on Your Analysis, Not Your SQL Code
Focus on Your Analysis, Not Your SQL CodeFocus on Your Analysis, Not Your SQL Code
Focus on Your Analysis, Not Your SQL Code
 
Lifecycle of a Data Science Project
Lifecycle of a Data Science ProjectLifecycle of a Data Science Project
Lifecycle of a Data Science Project
 
Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...
 
Back to Square One: Building a Data Science Team from Scratch
Back to Square One: Building a Data Science Team from ScratchBack to Square One: Building a Data Science Team from Scratch
Back to Square One: Building a Data Science Team from Scratch
 
How to understand trends in the data & software market
How to understand trends in the data & software marketHow to understand trends in the data & software market
How to understand trends in the data & software market
 
How Data Science Builds Better Products - Data Science Pop-up Seattle
How Data Science Builds Better Products - Data Science Pop-up SeattleHow Data Science Builds Better Products - Data Science Pop-up Seattle
How Data Science Builds Better Products - Data Science Pop-up Seattle
 
What Managers Need to Know about Data Science
What Managers Need to Know about Data ScienceWhat Managers Need to Know about Data Science
What Managers Need to Know about Data Science
 
20151016 Data Science For Project Managers
20151016 Data Science For Project Managers20151016 Data Science For Project Managers
20151016 Data Science For Project Managers
 
The Black Box: Interpretability, Reproducibility, and Data Management
The Black Box: Interpretability, Reproducibility, and Data ManagementThe Black Box: Interpretability, Reproducibility, and Data Management
The Black Box: Interpretability, Reproducibility, and Data Management
 
How can a quality engineering and assurance consultancy keep you ahead of others
How can a quality engineering and assurance consultancy keep you ahead of othersHow can a quality engineering and assurance consultancy keep you ahead of others
How can a quality engineering and assurance consultancy keep you ahead of others
 
Big data and Predictive Analytics By : Professor Lili Saghafi
Big data and Predictive Analytics By : Professor Lili SaghafiBig data and Predictive Analytics By : Professor Lili Saghafi
Big data and Predictive Analytics By : Professor Lili Saghafi
 
1645 track 1 bress_using his laptop
1645 track 1 bress_using his laptop1645 track 1 bress_using his laptop
1645 track 1 bress_using his laptop
 
Never Mind Big Data: We're Still Living in the Era of Big Spreadsheet
Never Mind Big Data: We're Still Living in the Era of Big SpreadsheetNever Mind Big Data: We're Still Living in the Era of Big Spreadsheet
Never Mind Big Data: We're Still Living in the Era of Big Spreadsheet
 
Giovanni Lanzani GoDataDriven
Giovanni Lanzani GoDataDrivenGiovanni Lanzani GoDataDriven
Giovanni Lanzani GoDataDriven
 
Beyond the Science Gateway
Beyond the Science GatewayBeyond the Science Gateway
Beyond the Science Gateway
 

Ähnlich wie Why Data Science Projects Fail?

Challenges of Executing AI
Challenges of Executing AIChallenges of Executing AI
Challenges of Executing AI
Dr. Umesh Rao.Hodeghatta
 

Ähnlich wie Why Data Science Projects Fail? (20)

Challenges of Executing AI
Challenges of Executing AIChallenges of Executing AI
Challenges of Executing AI
 
Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...
Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...
Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
Data kitchen 7 agile steps - big data fest 9-18-2015
Data kitchen   7 agile steps - big data fest 9-18-2015Data kitchen   7 agile steps - big data fest 9-18-2015
Data kitchen 7 agile steps - big data fest 9-18-2015
 
Rabobank - There is something about Data
Rabobank - There is something about DataRabobank - There is something about Data
Rabobank - There is something about Data
 
Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez
 
Data science presentation 2nd CI day
Data science presentation 2nd CI dayData science presentation 2nd CI day
Data science presentation 2nd CI day
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
 
Future-Proof Your Streaming Analytics Architecture- StreamAnalytix Webinar
Future-Proof Your Streaming Analytics Architecture- StreamAnalytix WebinarFuture-Proof Your Streaming Analytics Architecture- StreamAnalytix Webinar
Future-Proof Your Streaming Analytics Architecture- StreamAnalytix Webinar
 
1.-DE-LECTURE-1-INTRO-TO-DATA-ENGG.pptx
1.-DE-LECTURE-1-INTRO-TO-DATA-ENGG.pptx1.-DE-LECTURE-1-INTRO-TO-DATA-ENGG.pptx
1.-DE-LECTURE-1-INTRO-TO-DATA-ENGG.pptx
 
Washington DC DataOps Meetup -- Nov 2019
Washington DC DataOps Meetup   -- Nov 2019Washington DC DataOps Meetup   -- Nov 2019
Washington DC DataOps Meetup -- Nov 2019
 
Analyst Webinar: The Role of a Logical Architecture in Modern Data and Analytics
Analyst Webinar: The Role of a Logical Architecture in Modern Data and AnalyticsAnalyst Webinar: The Role of a Logical Architecture in Modern Data and Analytics
Analyst Webinar: The Role of a Logical Architecture in Modern Data and Analytics
 
Practical Applications of Machine Learning in Cybersecurity
Practical Applications of Machine Learning in CybersecurityPractical Applications of Machine Learning in Cybersecurity
Practical Applications of Machine Learning in Cybersecurity
 
L1 Introduction DS.pptx
L1 Introduction DS.pptxL1 Introduction DS.pptx
L1 Introduction DS.pptx
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
 
PXL Data Engineering Workshop By Selligent
PXL Data Engineering Workshop By Selligent PXL Data Engineering Workshop By Selligent
PXL Data Engineering Workshop By Selligent
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)
 
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
 

Mehr von Ethan Ram

Mehr von Ethan Ram (6)

App Install Fraud - Who? How? Why? and How to Fight it? - FraudCon 3.0 2019
App Install Fraud - Who? How? Why? and How to Fight it? - FraudCon 3.0 2019App Install Fraud - Who? How? Why? and How to Fight it? - FraudCon 3.0 2019
App Install Fraud - Who? How? Why? and How to Fight it? - FraudCon 3.0 2019
 
Kiss.ts - The Keep It Simple Software Stack for 2017++
Kiss.ts - The Keep It Simple Software Stack for 2017++Kiss.ts - The Keep It Simple Software Stack for 2017++
Kiss.ts - The Keep It Simple Software Stack for 2017++
 
How to Measure Agility Project Success in Business Terms
How to Measure Agility Project Success in Business TermsHow to Measure Agility Project Success in Business Terms
How to Measure Agility Project Success in Business Terms
 
Making the Agile Leap to Continuous Deployment
Making the Agile Leap to Continuous DeploymentMaking the Agile Leap to Continuous Deployment
Making the Agile Leap to Continuous Deployment
 
DevOps / Agile Tools Seminar 2013
DevOps / Agile Tools Seminar 2013DevOps / Agile Tools Seminar 2013
DevOps / Agile Tools Seminar 2013
 
Advanced topics in Agile: Implementing Scrum in a project-based company
Advanced topics in Agile: Implementing Scrum in a project-based companyAdvanced topics in Agile: Implementing Scrum in a project-based company
Advanced topics in Agile: Implementing Scrum in a project-based company
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Kürzlich hochgeladen (20)

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 

Why Data Science Projects Fail?

  • 1. WHY SO MANY DATA SCIENCE PROJECTS FAIL? Ethan Ram / Aug. 2018 1
  • 2. • Between 70% to 80% of corporate business intelligence projects fail (Gartner) • 55% of big data projects are never finished (Inforchimps) • Only 13% of organizations achieve full-scale production for their in-house big-data implementations (Qubole) • And the results… DATA SCIENCE PROJECTS FAIL… 9/3/2018 Why So Many Data Science Projects Fail 2
  • 3. Top of the list of developers who said they are looking for a new job*: • ML specialists - 14.3% • Data scientists - 13.2% 9/3/2018 Why So Many Data Science Projects Fail 3 “I HATE THIS JOB!” * 2018 Stack Overflow survey based on 64,000 developers’ answers
  • 4. Business objective and plan Build dataset Model data and validate Implement application Deploy Monitor, measure & optimize We’ll look at some common failures in each step and suggest better approaches. DATA SCIENCE APPLICATION LIFECYCLE 9/3/2018 Why So Many Data Science Projects Fail 4
  • 5. •First day success •No false-positives •100% accuracy •No business value expected •Expecting that the ML itself would be the product •Not defining the deliverable 9/3/2018 Why So Many Data Science Projects Fail 5 BUSINESS OBJECTIVE FAILURES
  • 6. • Google “fixed” its “racist” algorithm by removing gorillas from its image- labeling tech CAN YOU AFFORD A FALSE POSITIVE?
  • 7. •Very few business’ core product is AI/ML/Data based •Most use those tools to improve their bottom lines with existing products BE REALISTIC! 9/3/2018 Why So Many Data Science Projects Fail 7
  • 8. 1. Descriptive analysis (offline report) 2. Dashboard (real-time system) 3. Automated decision making system (“self driving” system) 4. Dataset with specific qualities (to be used by another ML) Define: leverage, friction to impact and cleanness 5. Methodology (dataset >> model) 6. Framework (API/SDK to build methodologies) 7. Proof-of-concept (proof a viable methodology) TYPES OF DELIVERABLES 9/3/2018 Why So Many Data Science Projects Fail 8
  • 9. •Missing diversity in the team •In many projects 80% of work is working on the dataset! •It’s a *research* project! •Short time to delivery PLANNING FAILURES Drue Conaway: Data Science Diagram 9/3/2018 Why So Many Data Science Projects Fail 9 Engineering
  • 10. YOLO V3 NETWORK ARCHITECTURE
  • 11. •Too little data to build on •Dataset is dirty •Missing data from the field DATA INVENTORY FAILURES 9/3/2018 Why So Many Data Science Projects Fail 11
  • 12. 9/3/2018 Why So Many Data Science Projects Fail 12 DIRTY DATASET: NEGATIVE INFLUENCE Data-set includes negative influence examples Resulting Classification (with confidence)
  • 13. 9/3/2018 Why So Many Data Science Projects Fail 13 DATA MODELING FAILURESYou need to be able to understand the result! •Jumping to conclusions on what the data is •Assuming it works based on a small sample •Feedback-loop in results •Missing cross validation •Choosing algorithms that are too heavy for the application
  • 14. Supervised learning Classification Linear classifiers / Fisher's discriminant Support vector machines / Least squares Quadratic classifiers Kernel estimation K-nearest neighbor Regression Linear Regression Logistic Regression CART Naïve Bayes Ensemble Bagging with Random Forests Boosting with XGBoost Unsupervised learning Association Apriori K-means Clustering Mean-Shift Density-Based Spatial EM-GMM Agglomerative Hierarchical Dimensionality Reduction Feature Selection Variance Thresholds Correlation Thresholds Genetic Algorithms (GA) Stepwise Search Feature extraction PCA Linear Discriminant Analysis (LDA) Autoencoders Reinforcement learning Exploration a.Criterion of optimality a.Brute force a.Value function a.Direct policy search 9/3/2018 Why So Many Data Science Projects Fail 14 Application Class Algorithms ML ALGORITHMS [PARTIAL] MAP Boosting Decision trees Random forests Neural networks Learning vector quantization
  • 15. •Requesting the Data Scientists team to build the application… •Not testing to scale •Switching from monitoring to automatic action-taking too fast •Missing safeguards on output •Not preparing for attack! APPLICATION FAILURES 9/3/2018 Why So Many Data Science Projects Fail 15
  • 16. 9/3/2018 Why So Many Data Science Projects Fail 16 DIRECT ATTACK EXAMPLE
  • 17.  9/3/2018 Why So Many Data Science Projects Fail 17 SYNTHESIZED ADVERSARIAL EXAMPLE
  • 18. “WE HAVEN’T SEEN ANYTHING LIKE THIS BEFORE…” 9/3/2018 Why So Many Data Science Projects Fail 18
  • 19. •Assuming it just works… • Not having a long enough beta • Missing feedback from real users •Missing KPIs • Measure business success • Find false-positives •Missing A-B testing built-in MONITOR > MEASURE > OPTIMIZE FAILURES 9/3/2018 Why So Many Data Science Projects Fail 19
  • 20. "Right now, a lot of our AI systems make decisions in ways that people don't really understand… And I don't think that… we want to end up with systems that people don't understand how they're making decisions.“ • ZUCKERBERG at Senate hearing 10-Apr-18 9/3/2018 Why So Many Data Science Projects Fail 20
  • 21. Business objective and plan Build dataset Model data and validate Implement application Deploy Monitor, measure & optimize DATA SCIENCE APPLICATION LIFECYCLE •Q&A 9/3/2018 Why So Many Data Science Projects Fail 21

Hinweis der Redaktion

  1. Expect magic to happen! YOLO (You Only Look Once) is a lightweight real-time object detection – can detect objects on a video-stream. It took 5 years to get to this version.. Dan Ariely: Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it...
  2. Precision  accuracy False positive vs false negative… Think of automatic cancer prediction – Can reduce false negatives of a human proffesional. - - like a radiologist
  3. Outputs of data science: Descriptive analysis (a report): clear answer to a clear question like what should be the platform to release the new product: Android or iOS. This is usually on offline system. Dashboard: helps a human decide and take action continuously, or again and again. This is usually an online/real-time system. Automated decision making: based on the dashboard, take automatic action. (“self driving” system) Data-set: data that is then used by another algorithm. For example, a cleaned-up list of addresses that were given by users on a form. A data-set used for training and benchmarking object extraction from images: COCO dataset or IMAGE-NET dataset. If your dataset is no good you will never get a good results. Qualities of a dataset: Leverage: the potential of the dataset - - what it can be used for Friction to impact: what is the additional work needed on the dataset to get a significant Cleanness: percentage of errors in the dataset that may sabotage the learning process. A methodology: the system/algorithm that is used to take a dataset and create a model that can then be used to answer a question. For example, how to estimate national poll results based on a sample questioner of 500 ppl >> a "Bias correction" system A Recommendation system Framework: an API or SDK that is used to build (code) methodologies. For example, Google's AI framework, TensorFlow. The framework should assist in lowering the Friction to impact. Proof-of-concept: it does not give the business impact but it gives the notion that the methodology is viable. Used for "fail-fast" or as a first milestone in a larger project.
  4. Computer science >> computer engineering Math & stats - - many times done by physicians Need ppl that can do the data tagging It’s a research project! Data Science is more than machine learning The importance of diversity in a data-scientists team: based on the diagram it is clear that it's very hard to find ppl that are able to answer all the above, especially for a team that is meant to answer questions from a diverse set of domains. Some like offline-analysis Some like real-time systems Some are about processes some are about tools Some are very good in one domain but has zero knowledge about anything else… Etc..   A good data scientist better have at least excellent proficiency in one side _and_ at least some understanding in the other 2 sides.  
  5. Example of how complicated an ML project can be… YOLO (You Only Look Once) is a lightweight real-time object detection – can detect objects on a video-stream. It took 5 years to get to this version..
  6. Google Translate’s Maori dataset is too small, leading to some funny mistakes. Better not train your model on these cat pictures… A satiations would know this… but a computer system engineer would not. Internal politics – would the engineer get access to the transactions database???
  7. Feedback-loop in results => need to understand causality. e.g. testing a 'like' btn size. clicking 'like' on a big-btn brings the item to top of list for everyone so it affects control-group clicks. Must make sure the observational inference matches causality. You need to be able to understand the result.
  8. … Boosting Consider changing to Yoav’s chart - - give examples
  9. Building an application with a good UX is outside the scope of a Data Scientist team Tay is an AI-based chat bot created by Microsoft and “unleashed” on Tweeter in 2016. It soon absorbed what people talking with her as the truth..
  10. URME Personal Surveillance Identity Prosthetic – by http://www.urmesurveillance.com/ Kerckhoffs-Shannon principle: “one ought to design systems under the assumption that the enemy will immediately gain full familiarity with them”. Don’t rely on the privacy of the model because one day or another, it will be leaked. You should not base your code entirely on open-source algorithms. You should not base your model on open data-sets
  11. Generative Adversarial Networks (GAN) – is sometimes used to fool the original network. In the example: projected gradient decent Synthesizing adversarial examples for neural networks is surprisingly easy: small, carefully-crafted perturbations to inputs can cause neural networks to misclassify inputs in arbitrarily chosen ways. Given that adversarial examples transfer to the physical world and can be made extremely robust, this is a real security concern.
  12. In the GIF: Tesla Model S adaptive cruise control 1 second before crashes into a parked Van on the roadside - - May 2016
  13. KPI: Key Performance Indicator
  14. Generative Adversarial Networks (GAN) – is sometimes used to fool the original network >> it can be used to understand how the neural network works.
  15. Original map (interactive): http://scikit-learn.org/stable/tutorial/machine_learning_map/
  16. Types of data - 2 axis: Is it a qualitive (e.g. questionnaire) or a quantitative (sales transaction logs) Our data (e.g. logs) or 3rd pty data (e.g. Wikipedia dataset)