Why Data Science Projects Fail?

WHY SO MANY
DATA SCIENCE
PROJECTS FAIL?
Ethan Ram / Aug. 2018
1

• Between 70% to 80% of
corporate business
intelligence projects fail
(Gartner)
• 55% of big data projects are
never finished (Inforchimps)
• Only 13% of organizations
achieve full-scale production
for their in-house big-data
implementations (Qubole)
• And the results…
DATA SCIENCE PROJECTS FAIL…
9/3/2018 Why So Many Data Science Projects Fail 2

Top of the list of
developers who said they
are looking for a new job*:
• ML specialists - 14.3%
• Data scientists - 13.2%
“I HATE THIS JOB!”
* 2018 Stack Overflow survey based on 64,000 developers’ answers

Business
objective and
plan
Build
dataset
Model data
and validate
Implement
application
Deploy
Monitor,
measure &
optimize
We’ll look at some common failures in each step and
suggest better approaches.
DATA SCIENCE APPLICATION LIFECYCLE

•First day success
•No false-positives
•100% accuracy
•No business value expected
•Expecting that the ML itself
would be the product
•Not defining the deliverable
BUSINESS OBJECTIVE FAILURES

• Google “fixed” its “racist” algorithm
by removing gorillas from its image-
labeling tech
CAN YOU AFFORD A FALSE POSITIVE?

•Very few business’ core product is
AI/ML/Data based
•Most use those tools to improve
their bottom lines with existing
products
BE REALISTIC!

1. Descriptive analysis (offline report)
2. Dashboard (real-time system)
3. Automated decision making system (“self
driving” system)
4. Dataset with specific qualities (to be used by
another ML)
Define: leverage, friction to impact and
cleanness
5. Methodology (dataset >> model)
6. Framework (API/SDK to build methodologies)
7. Proof-of-concept (proof a viable methodology)
TYPES OF DELIVERABLES

•Missing diversity in the team
•In many projects 80% of work is
working on the dataset!
•It’s a *research* project!
•Short time to delivery
PLANNING FAILURES
Drue Conaway: Data Science Diagram
Engineering

•Too little data to
build on
•Dataset is dirty
•Missing data from
the field
DATA INVENTORY FAILURES

DIRTY DATASET: NEGATIVE INFLUENCE
Data-set includes
negative influence
examples
Resulting
Classification
(with confidence)

DATA MODELING FAILURESYou need to be
able to understand
the result! •Jumping to conclusions on what
the data is
•Assuming it works based on a
small sample
•Feedback-loop in results
•Missing cross validation
•Choosing algorithms that are too
heavy for the application

Supervised
learning
Classification
Linear classifiers
/ Fisher's
discriminant
Support vector
machines /
Least squares
Quadratic
classifiers
Kernel
estimation
K-nearest
neighbor
Regression
Linear
Regression
Logistic
Regression
CART
Naïve Bayes
Ensemble
Bagging with
Random Forests
Boosting with
XGBoost
Unsupervised
learning
Association
Apriori
K-means
Clustering
Mean-Shift
Density-Based
Spatial
EM-GMM
Agglomerative
Hierarchical
Dimensionality
Reduction
Feature
Selection
Variance
Thresholds
Correlation
Thresholds
Genetic
Algorithms (GA)
Stepwise Search
Feature
extraction
PCA
Linear
Discriminant
Analysis (LDA)
Autoencoders
Reinforcement
learning
Exploration
a.Criterion of
optimality
a.Brute force
a.Value function
a.Direct policy
search
Application
Class
Algorithms
ML ALGORITHMS [PARTIAL] MAP
Boosting
Decision trees
Random forests
Neural networks
Learning vector
quantization

•Requesting the Data Scientists
team to build the application…
•Not testing to scale
•Switching from monitoring to
automatic action-taking too fast
•Missing safeguards on output
•Not preparing for attack!
APPLICATION FAILURES

DIRECT ATTACK EXAMPLE


SYNTHESIZED ADVERSARIAL EXAMPLE

“WE HAVEN’T SEEN ANYTHING LIKE THIS BEFORE…”

•Assuming it just works…
• Not having a long enough
beta
• Missing feedback from real
users
•Missing KPIs
• Measure business success
• Find false-positives
•Missing A-B testing built-in
MONITOR > MEASURE > OPTIMIZE FAILURES

"Right now, a lot of our AI
systems make decisions in
ways that people don't really
understand… And I don't
think that… we want to end
up with systems that people
don't understand how they're
making decisions.“
• ZUCKERBERG at Senate
hearing 10-Apr-18

Business
objective and
plan
Build
dataset
Model data
and validate
Implement
application
Deploy
Monitor,
measure &
optimize
DATA SCIENCE APPLICATION LIFECYCLE
•Q&A

Why Data Science Projects Fail?

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Why Data Science Projects Fail?

Ähnlich wie Why Data Science Projects Fail? (20)

Mehr von Ethan Ram

Mehr von Ethan Ram (6)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Why Data Science Projects Fail?

Hinweis der Redaktion