As a manager, what do you need to know in order for the data-science project you are leading to be successful?
This presentation looks into a data-science project lifecycle, points out common failures and gives some hints on how to avoid common pitfalls. Examples included.
The target audience is managerial - half technical.
2. • Between 70% to 80% of
corporate business
intelligence projects fail
(Gartner)
• 55% of big data projects are
never finished (Inforchimps)
• Only 13% of organizations
achieve full-scale production
for their in-house big-data
implementations (Qubole)
• And the results…
DATA SCIENCE PROJECTS FAIL…
9/3/2018 Why So Many Data Science Projects Fail 2
3. Top of the list of
developers who said they
are looking for a new job*:
• ML specialists - 14.3%
• Data scientists - 13.2%
9/3/2018 Why So Many Data Science Projects Fail 3
“I HATE THIS JOB!”
* 2018 Stack Overflow survey based on 64,000 developers’ answers
4. Business
objective and
plan
Build
dataset
Model data
and validate
Implement
application
Deploy
Monitor,
measure &
optimize
We’ll look at some common failures in each step and
suggest better approaches.
DATA SCIENCE APPLICATION LIFECYCLE
9/3/2018 Why So Many Data Science Projects Fail 4
5. •First day success
•No false-positives
•100% accuracy
•No business value expected
•Expecting that the ML itself
would be the product
•Not defining the deliverable
9/3/2018 Why So Many Data Science Projects Fail 5
BUSINESS OBJECTIVE FAILURES
6. • Google “fixed” its “racist” algorithm
by removing gorillas from its image-
labeling tech
CAN YOU AFFORD A FALSE POSITIVE?
7. •Very few business’ core product is
AI/ML/Data based
•Most use those tools to improve
their bottom lines with existing
products
BE REALISTIC!
9/3/2018 Why So Many Data Science Projects Fail 7
8. 1. Descriptive analysis (offline report)
2. Dashboard (real-time system)
3. Automated decision making system (“self
driving” system)
4. Dataset with specific qualities (to be used by
another ML)
Define: leverage, friction to impact and
cleanness
5. Methodology (dataset >> model)
6. Framework (API/SDK to build methodologies)
7. Proof-of-concept (proof a viable methodology)
TYPES OF DELIVERABLES
9/3/2018 Why So Many Data Science Projects Fail 8
9. •Missing diversity in the team
•In many projects 80% of work is
working on the dataset!
•It’s a *research* project!
•Short time to delivery
PLANNING FAILURES
Drue Conaway: Data Science Diagram
9/3/2018 Why So Many Data Science Projects Fail 9
Engineering
11. •Too little data to
build on
•Dataset is dirty
•Missing data from
the field
DATA INVENTORY FAILURES
9/3/2018 Why So Many Data Science Projects Fail 11
12. 9/3/2018 Why So Many Data Science Projects Fail 12
DIRTY DATASET: NEGATIVE INFLUENCE
Data-set includes
negative influence
examples
Resulting
Classification
(with confidence)
13. 9/3/2018 Why So Many Data Science Projects Fail 13
DATA MODELING FAILURESYou need to be
able to understand
the result! •Jumping to conclusions on what
the data is
•Assuming it works based on a
small sample
•Feedback-loop in results
•Missing cross validation
•Choosing algorithms that are too
heavy for the application
14. Supervised
learning
Classification
Linear classifiers
/ Fisher's
discriminant
Support vector
machines /
Least squares
Quadratic
classifiers
Kernel
estimation
K-nearest
neighbor
Regression
Linear
Regression
Logistic
Regression
CART
Naïve Bayes
Ensemble
Bagging with
Random Forests
Boosting with
XGBoost
Unsupervised
learning
Association
Apriori
K-means
Clustering
Mean-Shift
Density-Based
Spatial
EM-GMM
Agglomerative
Hierarchical
Dimensionality
Reduction
Feature
Selection
Variance
Thresholds
Correlation
Thresholds
Genetic
Algorithms (GA)
Stepwise Search
Feature
extraction
PCA
Linear
Discriminant
Analysis (LDA)
Autoencoders
Reinforcement
learning
Exploration
a.Criterion of
optimality
a.Brute force
a.Value function
a.Direct policy
search
9/3/2018 Why So Many Data Science Projects Fail 14
Application
Class
Algorithms
ML ALGORITHMS [PARTIAL] MAP
Boosting
Decision trees
Random forests
Neural networks
Learning vector
quantization
15. •Requesting the Data Scientists
team to build the application…
•Not testing to scale
•Switching from monitoring to
automatic action-taking too fast
•Missing safeguards on output
•Not preparing for attack!
APPLICATION FAILURES
9/3/2018 Why So Many Data Science Projects Fail 15
16. 9/3/2018 Why So Many Data Science Projects Fail 16
DIRECT ATTACK EXAMPLE
17.
9/3/2018 Why So Many Data Science Projects Fail 17
SYNTHESIZED ADVERSARIAL EXAMPLE
18. “WE HAVEN’T SEEN ANYTHING LIKE THIS BEFORE…”
9/3/2018 Why So Many Data Science Projects Fail 18
19. •Assuming it just works…
• Not having a long enough
beta
• Missing feedback from real
users
•Missing KPIs
• Measure business success
• Find false-positives
•Missing A-B testing built-in
MONITOR > MEASURE > OPTIMIZE FAILURES
9/3/2018 Why So Many Data Science Projects Fail 19
20. "Right now, a lot of our AI
systems make decisions in
ways that people don't really
understand… And I don't
think that… we want to end
up with systems that people
don't understand how they're
making decisions.“
• ZUCKERBERG at Senate
hearing 10-Apr-18
9/3/2018 Why So Many Data Science Projects Fail 20
Expect magic to happen!
YOLO (You Only Look Once) is a lightweight real-time object detection – can detect objects on a video-stream.
It took 5 years to get to this version..
Dan Ariely: Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it...
Precision accuracy
False positive vs false negative…
Think of automatic cancer prediction –
Can reduce false negatives of a human proffesional. - - like a radiologist
Outputs of data science:
Descriptive analysis (a report): clear answer to a clear question like what should be the platform to release the new product: Android or iOS. This is usually on offline system.
Dashboard: helps a human decide and take action continuously, or again and again. This is usually an online/real-time system.
Automated decision making: based on the dashboard, take automatic action. (“self driving” system)
Data-set: data that is then used by another algorithm. For example, a cleaned-up list of addresses that were given by users on a form. A data-set used for training and benchmarking object extraction from images: COCO dataset or IMAGE-NET dataset. If your dataset is no good you will never get a good results.
Qualities of a dataset:
Leverage: the potential of the dataset - - what it can be used for
Friction to impact: what is the additional work needed on the dataset to get a significant
Cleanness: percentage of errors in the dataset that may sabotage the learning process.
A methodology: the system/algorithm that is used to take a dataset and create a model that can then be used to answer a question. For example, how to estimate national poll results based on a sample questioner of 500 ppl >> a "Bias correction" system
A Recommendation system
Framework: an API or SDK that is used to build (code) methodologies. For example, Google's AI framework, TensorFlow. The framework should assist in lowering the Friction to impact.
Proof-of-concept: it does not give the business impact but it gives the notion that the methodology is viable. Used for "fail-fast" or as a first milestone in a larger project.
Computer science >> computer engineering
Math & stats - - many times done by physicians
Need ppl that can do the data tagging
It’s a research project!
Data Science is more than machine learning
The importance of diversity in a data-scientists team: based on the diagram it is clear that it's very hard to find ppl that are able to answer all the above, especially for a team that is meant to answer questions from a diverse set of domains.
Some like offline-analysis
Some like real-time systems
Some are about processes some are about tools
Some are very good in one domain but has zero knowledge about anything else…
Etc..
A good data scientist better have at least excellent proficiency in one side _and_ at least some understanding in the other 2 sides.
Example of how complicated an ML project can be…YOLO (You Only Look Once) is a lightweight real-time object detection – can detect objects on a video-stream.
It took 5 years to get to this version..
Google Translate’s Maori dataset is too small, leading to some funny mistakes.
Better not train your model on these cat pictures…
A satiations would know this… but a computer system engineer would not.
Internal politics – would the engineer get access to the transactions database???
Feedback-loop in results => need to understand causality. e.g. testing a 'like' btn size. clicking 'like' on a big-btn brings the item to top of list for everyone so it affects control-group clicks.
Must make sure the observational inference matches causality. You need to be able to understand the result.
… Boosting
Consider changing to Yoav’s chart - - give examples
Building an application with a good UX is outside the scope of a Data Scientist team
Tay is an AI-based chat bot created by Microsoft and “unleashed” on Tweeter in 2016. It soon absorbed what people talking with her as the truth..
URME Personal Surveillance Identity Prosthetic – by http://www.urmesurveillance.com/
Kerckhoffs-Shannon principle: “one ought to design systems under the assumption that the enemy will immediately gain full familiarity with them”.
Don’t rely on the privacy of the model because one day or another, it will be leaked.
You should not base your code entirely on open-source algorithms.
You should not base your model on open data-sets
Generative Adversarial Networks (GAN) – is sometimes used to fool the original network.
In the example: projected gradient decent
Synthesizing adversarial examples for neural networks is surprisingly easy: small, carefully-crafted perturbations to inputs can cause neural networks to misclassify inputs in arbitrarily chosen ways. Given that adversarial examples transfer to the physical world and can be made extremely robust, this is a real security concern.
In the GIF: Tesla Model S adaptive cruise control 1 second before crashes into a parked Van on the roadside - - May 2016
KPI: Key Performance Indicator
Generative Adversarial Networks (GAN) – is sometimes used to fool the original network >> it can be used to understand how the neural network works.
Original map (interactive): http://scikit-learn.org/stable/tutorial/machine_learning_map/
Types of data - 2 axis:
Is it a qualitive (e.g. questionnaire) or a quantitative (sales transaction logs)
Our data (e.g. logs) or 3rd pty data (e.g. Wikipedia dataset)