Why Data Science Projects Fail

Why Data Science
Projects Fail
1

Gabriella Lio
Data Scientist
Glio@sensecorp.com
2
BBA – McCombs School of Business
Management Information Systems
MS – McCombs School of Business
Business Analytics

AGENDA
1. STATE OF DATA
SCIENCE OVERVIEW
2. WHY DATA SCIENCE
PROJECTS FAIL
3. PROJECT DO’S AND
DON’TS

Data science literacy is growing
across business disciplines and is
becoming critical for nearly all
enterprise job titles
4
Data Science Adoption Across Roles

DATA IS INACCURATE, SILOED, AND SLOW
Successful data science initiatives rely on aligning data quality, master data
management, and data governance throughout your organization to ensure
they are fully integrated and working together.
LACK OF BUSINESS READINESS
Establish a clear and honest understanding of requirements and capabilities
needed to take on data science initiatives. While investing in technology and
people, also conduct thorough due diligence to define achievable use cases.
OPERATIONALIZATION IS UNREACHABLE
Set yourself up for success by investing in business modernization. Make
sure your technology stack is up to date, data pipelines and processes are
scalable, and data scientists & engineers collaborate.
WHY DATA SCIENCE
PROJECTS FAIL 87% of data science projects never
make it into production.

Data is Inaccurate, Siloed, and Slow
Highly defined process with multiple
steps needed to create, monitor, and
deliver clean water
7
Delivery of clean data generally lacks
the required level of rigor and
investment in processes, technologies,
and resources
CLEAN WATER CLEAN DATA

How do we get clean data that is available across the
organization?
• Process that begins with Data Governance (DG), incorporates Data Quality (DQ), and
finally leverages Master Data Management (MDM)
• Most companies focus on only one or some of these efforts without coupling them
together
8

Data Governance
Data Governance is the exercise of authority and control (planning,
monitoring, and enforcement) over the management of data assets.
9

Cleveland and the Cuyahoga River
10

Data Quality Across 6 Key Dimensions
Key Contributors of Data Quality Issues
1. Source System Issues. Sub-optimal system configuration
and fields not being used for intended purposes
2. Data Input Errors. Missing data or Freeform fields may be
left blank or populated with incorrect data. Additionally
fields may not always end up being populated with data or
populated at the right time
3. Proliferation of Redundant Data. With limited availability
of certified data, different teams source their own data
leading to multiple copies.
4. Inconsistent Usage. Without a defined set of enterprise-
wide metrics, data is often defined and used in varied
ways (e.g. different KPIs, different source sets of data)
5. Lack of Data Auditing. Little to no visibility into the actual
data quality or enforcement to improve the data quality
11

Master Data Management
• DQ can be considered a separate discipline, many MDM technology providers today
include DQ within their MDM technology offering
• DQ and MDM can only be successful when operating under a well implemented Data
Governance program
12
ERP system
CRM System
Claims System
Rules are applied to
determine golden
record to ensure
alignment around
common use of data
Gabby Lio 1709 Tree
Drive
Austin TX 78745 10-31-1990
Gaby Lio 1907 Steele
Ct.
Austin TX 78789 10-31-1990
Gabriella Lio 1709 Tree
Drive
Austin TX 78745 10-30-1990
Master Data Management is a technology driven discipline that allows companies to accurately combine data
from multiple data sources; It is used to create the master definition for data domains and to drive consistent use
of high-integrity data across the company

Data Governance in the Age of AI
13
• When building a predictive
model, data scientists spend
most of their time cleaning
and identifying data to use
• Profiling the data
• The worse the quality of the
data you train with, the
worse the result of the AI
• AI projects shouldn’t be
started until you know you
have good data
• Good data in, great decisions
out
• Privacy: AI system must
comply with privacy laws
that require transparency
about the collection, use,
and storage of data
• Fairness: Minimizing bias in
our data
SAVES TIME GARBAGE IN GARBAGE OUT ETHICAL AI

Lack of Business Readiness
• Organizations often lack the
necessary analytic team structure to:
1. Best enable a data driven culture
2. Realize the full potential, and ROI, of
analytical capabilities
• Companies rarely lack data, tools, or
technologies
• More of a people and process issue
• Purposefully choosing an
organizational strategy is one of the
first and foremost decisions and
analytics leader can make
16
PEOPLE
PROCESS
TECHNOLOGY

Organizational Data Science Strategies
17
Decentralized CentralizedSemi-centralized
Benefits
• Subject matter expertise quickly available/accessible
• Analytics functions and teams are closely aligned to
business, issues, and customers
Challenges
• Redundancy in physical resources and talent
• Inconsistency in process, results, and tools
• Focus on local issues
• No standardization and not leveraging scale
Benefits
• Shared services, processes, tools, and methodologies
• On-demand provisioning and better cost control
• Continuous improvement is likely as efforts are
focused on iteratively improving a core business
Challenges
• Less transparent allocation resources among
different initiatives
• Tends to bias certain business units
• Difficulty in cross-functional alignment and
consensus
Benefits
• Shared services, processes, tools, and methodologies
• On-demand provisioning and better cost control
• Best positioned for long term innovation and value by being
removed from day-to-day fires of business units
Challenges
• Requires CXO-level commitment and investment to
empower fast and effective organizational adoption
• Business and subject matter expertise requires more effort,
engagement, and evangelism to attain

Defining Achievable Use Cases in 3 Steps
List out potential
use cases
• A question that can be
answered using data
• Looking for an answer,
an explanation, or just
validation
• Steer away from bias
towards things only
YOU know about and
bias towards things
people think are too
hard or impossible
Evaluate each use
case
• Level of
Effort/Technical
Feasibility
• Business Value
Prioritize Use Cases
• Low Level of
Effort/High Technical
Feasibility coupled with
high busines value is a
good place to start
18

Evaluating DS Use Cases Examples
19

Building vs. Scalable Machine Learning
21
BUILDING MACHINE LEARNING SCALING MACHINE LEARNING
COMMON TOOLS USED
Scikit-Learn, Pandas, Jupyter,
Local Enviornment
Mlflow, MLlib, Spark, IDEs, DVC,
Cloud Enviornment
MODEL TRAINING AND
PREDICTION
Managed by data scientists Automatically orchestrated
DEPLOYED Not deployed Deployed in production
MODEL VALIDATION Manual Automated

What do we need to achieve Operationalization?
Storage
• Volume of data is
growing
• Need somewhere to put
all this data
Robust Data
• Need data from different
sources (i.e. CRM, ERP,
Spreadsheets)
• Across the business (i.e.
HR, Finance, Customer)
• Historical
• Readily available
Compute
• High performing data
processing
• Processing power to
drive out our analysis
Output
• Communicating Findings
• Graphs/Charts
• Presentations
22
Model Deployment
• Testing
• Automated Deployment
• Ethics in AI
o Trusted model
o Fair model
Model Management
• Statistical Process Control
• Data Drift and Model Drift
• Stale Models

Technology Stack is Up-To-Date
23
Highly scalable, managed
cloud data warehouses enable
you to store TBs of data with
just a few lines of SQL and no
infrastructure
On demand pricing means
technology is affordable for
everyone, with only a few
minutes of set up time
Examples: Amazon Redshift,
Google BigQuery, Snowflake,
Azure Synapse
Ensures you have the fuel to
power your warehouse and
tools
Without data, you have
nothing to analyze
Especially important when
giving real-time predictions
and analysis on streaming data
Examples: Apache Kafka,
Apache Airflow, Confluent,
Spark, Python, REST APIs
Need a framework for the
entire life cycle of a data
science project
Platform contains all the tools
required for executing the
lifecycle of the data science
project spanning across
different phases
Examples: Python, R, Apache
Spark, Anaconda, Databricks,
H2O.ai, Alteryx, Domino
In the world of Big Data, data
visualization tools and
technologies are essential to
analyze massive amounts of
information and make data-
driven decisions
Examples: Matplotlib,
Tableau, Power BI, Plotly, D3,
QlickView
DATA WAREHOUSES DATA PIPELINES ANALYTICAL TOOLS VISUALIZATIONS

Collaboration between Data Scientists & Data Engineering
• Data Engineering involves
collecting relevant data. They
move and transform this data
into “pipelines” for the Data
Science team.
• Data Scientists analyze, test,
aggregate, optimize the data
and present it for the company.
• Some companies with
advanced processes complete
their teams with AI Engineers,
Machine Learning Engineers or
Deep Learning Engineers.
24
It becomes quite understandable that all
these tasks have to be divided and given to
specific data professionals.

Collaboration between Data Scientists & Data Engineers
25
Data Engineering Skills Analytical Skills
Data Engineering
Data Scientist
• A data engineering resource can do some basic to intermediate level analytics
but will be hard pressed to do the advanced analytics that a data scientist does.
• Having a data scientist create a data pipeline is at the far edge of their skills but
is the bread and butter of a data engineering resource.
• The two roles are complementary, with data engineering resources supporting
the work of data scientists.
Both a data scientist and data
engineering resources overlap
on engineering and analysis.

What do you do when you notice…
Implement Data Governance,
which will enable Data Quality
and Master Data
Management
Create an organizational
strategy for data science that
works for your company and
prioritize use cases iteratively
Realize the difference
between building and scaling
machine learning models,
update your technology stack,
and make sure data scientists
collaborate with data
engineering resources
3 Key Takeaways
26
Data is inaccurate, siloed,
and slow?
There is a lack of business
readiness?
Operationalization is
unreachable?

{ }
Survey the Audience
Discovering Project Do’s and Don’ts

28
When designing a solution is your team more focused on…
orDesigning the
‘supreme’ solution
Beginning on the
solution early, being
agile, and starting
small

29
What is the average timeline for deliverables on data science
projects you have been apart of?
orTimelines that deliver
on weekly scales
Timelines that deliver
on monthly scales

30
When engaging in a project is your team...
orHyper-focused on the
business problem
Hyper-focused on the
solution

PROJECT DO’S AND DONT’S
Begin early, be agile, and start small
Timelines that deliver on weekly scales
Aim for “good enough’ & adding business value
4-6 person teams
Hyper-focused on the business problem
Co-developing with SMEs and stakeholders
Focus on fast mover strategyFocus on first mover strategy
Designing the ‘supreme’ solution
Timelines that deliver on monthly scales
Aim for perfect accuracy
Large, slow-moving teams
Hyper-focused on the solution
Developing in silos

32
BUSINESS READINESS
TECHNICALCAPABILITY
c
Experimentation
Business leaders
are exploring the
landscape, talking
to vendors, etc.
Clean Data
Data is reliable and
accurate for deep
analysis and
Modeling
Established Data
Governance
Accountable and
consistent standards are
implemented
Proof of Value
Real and measurable
prototypes are scoped
and built for technical
understanding and
business value
Modern Data
Architecture
Data is no longer
slow or siloed
thanks to next-gen
technology stacks
and business
stakeholder buy-in
Scalable Machine
Learning
Teams, technologies,
and techniques are
highly efficient at
building, deploying, and
managing data
pipelines across the
enterprise
AI Adoption
AI has been
seamlessly
integrated into
enterprise processes
and technologies
THE JOURNEY TO
AI ADOPTION

Why Data Science Projects Fail

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Why Data Science Projects Fail

Ähnlich wie Why Data Science Projects Fail (20)

Mehr von Sense Corp

Mehr von Sense Corp (8)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Why Data Science Projects Fail

Hinweis der Redaktion