SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Why Data Science
Projects Fail
1
Gabriella Lio
Data Scientist
Glio@sensecorp.com
2
BBA – McCombs School of Business
Management Information Systems
MS – McCombs School of Business
Business Analytics
AGENDA
1. STATE OF DATA
SCIENCE OVERVIEW
2. WHY DATA SCIENCE
PROJECTS FAIL
3. PROJECT DO’S AND
DON’TS
Data science literacy is growing
across business disciplines and is
becoming critical for nearly all
enterprise job titles
4
Data Science Adoption Across Roles
DATA IS INACCURATE, SILOED, AND SLOW
Successful data science initiatives rely on aligning data quality, master data
management, and data governance throughout your organization to ensure
they are fully integrated and working together.
LACK OF BUSINESS READINESS
Establish a clear and honest understanding of requirements and capabilities
needed to take on data science initiatives. While investing in technology and
people, also conduct thorough due diligence to define achievable use cases.
OPERATIONALIZATION IS UNREACHABLE
Set yourself up for success by investing in business modernization. Make
sure your technology stack is up to date, data pipelines and processes are
scalable, and data scientists & engineers collaborate.
WHY DATA SCIENCE
PROJECTS FAIL 87% of data science projects never
make it into production.
DATA IS INACCURATE, SILOED, AND SLOW
Successful data science initiatives rely on aligning data quality, master data
management, and data governance throughout your organization to ensure
they are fully integrated and working together.
LACK OF BUSINESS READINESS
Establish a clear and honest understanding of requirements and capabilities
needed to take on data science initiatives. While investing in technology and
people, also conduct thorough due diligence to define achievable use cases.
OPERATIONALIZATION IS UNREACHABLE
Set yourself up for success by investing in business modernization. Make
sure your technology stack is up to date, data pipelines and processes are
scalable, and data scientists & engineers collaborate.
WHY DATA SCIENCE
PROJECTS FAIL 87% of data science projects never
make it into production.
Data is Inaccurate, Siloed, and Slow
Highly defined process with multiple
steps needed to create, monitor, and
deliver clean water
7
Delivery of clean data generally lacks
the required level of rigor and
investment in processes, technologies,
and resources
CLEAN WATER CLEAN DATA
How do we get clean data that is available across the
organization?
• Process that begins with Data Governance (DG), incorporates Data Quality (DQ), and
finally leverages Master Data Management (MDM)
• Most companies focus on only one or some of these efforts without coupling them
together
8
Data Governance
Data Governance is the exercise of authority and control (planning,
monitoring, and enforcement) over the management of data assets.
9
Cleveland and the Cuyahoga River
10
Data Quality Across 6 Key Dimensions
Key Contributors of Data Quality Issues
1. Source System Issues. Sub-optimal system configuration
and fields not being used for intended purposes
2. Data Input Errors. Missing data or Freeform fields may be
left blank or populated with incorrect data. Additionally
fields may not always end up being populated with data or
populated at the right time
3. Proliferation of Redundant Data. With limited availability
of certified data, different teams source their own data
leading to multiple copies.
4. Inconsistent Usage. Without a defined set of enterprise-
wide metrics, data is often defined and used in varied
ways (e.g. different KPIs, different source sets of data)
5. Lack of Data Auditing. Little to no visibility into the actual
data quality or enforcement to improve the data quality
11
Master Data Management
• DQ can be considered a separate discipline, many MDM technology providers today
include DQ within their MDM technology offering
• DQ and MDM can only be successful when operating under a well implemented Data
Governance program
12
ERP system
CRM System
Claims System
Rules are applied to
determine golden
record to ensure
alignment around
common use of data
Gabby Lio 1709 Tree
Drive
Austin TX 78745 10-31-1990
Gaby Lio 1907 Steele
Ct.
Austin TX 78789 10-31-1990
Gabriella Lio 1709 Tree
Drive
Austin TX 78745 10-30-1990
Master Data Management is a technology driven discipline that allows companies to accurately combine data
from multiple data sources; It is used to create the master definition for data domains and to drive consistent use
of high-integrity data across the company
Data Governance in the Age of AI
13
• When building a predictive
model, data scientists spend
most of their time cleaning
and identifying data to use
• Profiling the data
• The worse the quality of the
data you train with, the
worse the result of the AI
• AI projects shouldn’t be
started until you know you
have good data
• Good data in, great decisions
out
• Privacy: AI system must
comply with privacy laws
that require transparency
about the collection, use,
and storage of data
• Fairness: Minimizing bias in
our data
SAVES TIME GARBAGE IN GARBAGE OUT ETHICAL AI
The Risk of Bad Data
14
DATA IS INACCURATE, SILOED, AND SLOW
Successful data science initiatives rely on aligning data quality, master data
management, and data governance throughout your organization to ensure
they are fully integrated and working together.
LACK OF BUSINESS READINESS
Establish a clear and honest understanding of requirements and capabilities
needed to take on data science initiatives. While investing in technology and
people, also conduct thorough due diligence to define achievable use cases.
OPERATIONALIZATION IS UNREACHABLE
Set yourself up for success by investing in business modernization. Make
sure your technology stack is up to date, data pipelines and processes are
scalable, and data scientists & engineers collaborate.
WHY DATA SCIENCE
PROJECTS FAIL 87% of data science projects never
make it into production.
Lack of Business Readiness
• Organizations often lack the
necessary analytic team structure to:
1. Best enable a data driven culture
2. Realize the full potential, and ROI, of
analytical capabilities
• Companies rarely lack data, tools, or
technologies
• More of a people and process issue
• Purposefully choosing an
organizational strategy is one of the
first and foremost decisions and
analytics leader can make
16
PEOPLE
PROCESS
TECHNOLOGY
Organizational Data Science Strategies
17
Decentralized CentralizedSemi-centralized
Benefits
• Subject matter expertise quickly available/accessible
• Analytics functions and teams are closely aligned to
business, issues, and customers
Challenges
• Redundancy in physical resources and talent
• Inconsistency in process, results, and tools
• Focus on local issues
• No standardization and not leveraging scale
Benefits
• Shared services, processes, tools, and methodologies
• On-demand provisioning and better cost control
• Continuous improvement is likely as efforts are
focused on iteratively improving a core business
Challenges
• Less transparent allocation resources among
different initiatives
• Tends to bias certain business units
• Difficulty in cross-functional alignment and
consensus
Benefits
• Shared services, processes, tools, and methodologies
• On-demand provisioning and better cost control
• Best positioned for long term innovation and value by being
removed from day-to-day fires of business units
Challenges
• Requires CXO-level commitment and investment to
empower fast and effective organizational adoption
• Business and subject matter expertise requires more effort,
engagement, and evangelism to attain
Defining Achievable Use Cases in 3 Steps
List out potential
use cases
• A question that can be
answered using data
• Looking for an answer,
an explanation, or just
validation
• Steer away from bias
towards things only
YOU know about and
bias towards things
people think are too
hard or impossible
Evaluate each use
case
• Level of
Effort/Technical
Feasibility
• Business Value
Prioritize Use Cases
• Low Level of
Effort/High Technical
Feasibility coupled with
high busines value is a
good place to start
18
Evaluating DS Use Cases Examples
19
DATA IS INACCURATE, SILOED, AND SLOW
Successful data science initiatives rely on aligning data quality, master data
management, and data governance throughout your organization to ensure
they are fully integrated and working together.
LACK OF BUSINESS READINESS
Establish a clear and honest understanding of requirements and capabilities
needed to take on data science initiatives. While investing in technology and
people, also conduct thorough due diligence to define achievable use cases.
OPERATIONALIZATION IS UNREACHABLE
Set yourself up for success by investing in business modernization. Make
sure your technology stack is up to date, data pipelines and processes are
scalable, and data scientists & engineers collaborate.
WHY DATA SCIENCE
PROJECTS FAIL 87% of data science projects never
make it into production.
Building vs. Scalable Machine Learning
21
BUILDING MACHINE LEARNING SCALING MACHINE LEARNING
COMMON TOOLS USED
Scikit-Learn, Pandas, Jupyter,
Local Enviornment
Mlflow, MLlib, Spark, IDEs, DVC,
Cloud Enviornment
MODEL TRAINING AND
PREDICTION
Managed by data scientists Automatically orchestrated
DEPLOYED Not deployed Deployed in production
MODEL VALIDATION Manual Automated
What do we need to achieve Operationalization?
Storage
• Volume of data is
growing
• Need somewhere to put
all this data
Robust Data
• Need data from different
sources (i.e. CRM, ERP,
Spreadsheets)
• Across the business (i.e.
HR, Finance, Customer)
• Historical
• Readily available
Compute
• High performing data
processing
• Processing power to
drive out our analysis
Output
• Communicating Findings
• Graphs/Charts
• Presentations
22
Model Deployment
• Testing
• Automated Deployment
• Ethics in AI
o Trusted model
o Fair model
Model Management
• Statistical Process Control
• Data Drift and Model Drift
• Stale Models
Technology Stack is Up-To-Date
23
Highly scalable, managed
cloud data warehouses enable
you to store TBs of data with
just a few lines of SQL and no
infrastructure
On demand pricing means
technology is affordable for
everyone, with only a few
minutes of set up time
Examples: Amazon Redshift,
Google BigQuery, Snowflake,
Azure Synapse
Ensures you have the fuel to
power your warehouse and
tools
Without data, you have
nothing to analyze
Especially important when
giving real-time predictions
and analysis on streaming data
Examples: Apache Kafka,
Apache Airflow, Confluent,
Spark, Python, REST APIs
Need a framework for the
entire life cycle of a data
science project
Platform contains all the tools
required for executing the
lifecycle of the data science
project spanning across
different phases
Examples: Python, R, Apache
Spark, Anaconda, Databricks,
H2O.ai, Alteryx, Domino
In the world of Big Data, data
visualization tools and
technologies are essential to
analyze massive amounts of
information and make data-
driven decisions
Examples: Matplotlib,
Tableau, Power BI, Plotly, D3,
QlickView
DATA WAREHOUSES DATA PIPELINES ANALYTICAL TOOLS VISUALIZATIONS
Collaboration between Data Scientists & Data Engineering
• Data Engineering involves
collecting relevant data. They
move and transform this data
into “pipelines” for the Data
Science team.
• Data Scientists analyze, test,
aggregate, optimize the data
and present it for the company.
• Some companies with
advanced processes complete
their teams with AI Engineers,
Machine Learning Engineers or
Deep Learning Engineers.
24
It becomes quite understandable that all
these tasks have to be divided and given to
specific data professionals.
Collaboration between Data Scientists & Data Engineers
25
Data Engineering Skills Analytical Skills
Data Engineering
Data Scientist
• A data engineering resource can do some basic to intermediate level analytics
but will be hard pressed to do the advanced analytics that a data scientist does.
• Having a data scientist create a data pipeline is at the far edge of their skills but
is the bread and butter of a data engineering resource.
• The two roles are complementary, with data engineering resources supporting
the work of data scientists.
Both a data scientist and data
engineering resources overlap
on engineering and analysis.
What do you do when you notice…
Implement Data Governance,
which will enable Data Quality
and Master Data
Management
Create an organizational
strategy for data science that
works for your company and
prioritize use cases iteratively
Realize the difference
between building and scaling
machine learning models,
update your technology stack,
and make sure data scientists
collaborate with data
engineering resources
3 Key Takeaways
26
Data is inaccurate, siloed,
and slow?
There is a lack of business
readiness?
Operationalization is
unreachable?
{ }
Survey the Audience
Discovering Project Do’s and Don’ts
28
When designing a solution is your team more focused on…
orDesigning the
‘supreme’ solution
Beginning on the
solution early, being
agile, and starting
small
29
What is the average timeline for deliverables on data science
projects you have been apart of?
orTimelines that deliver
on weekly scales
Timelines that deliver
on monthly scales
30
When engaging in a project is your team...
orHyper-focused on the
business problem
Hyper-focused on the
solution
PROJECT DO’S AND DONT’S
Begin early, be agile, and start small
Timelines that deliver on weekly scales
Aim for “good enough’ & adding business value
4-6 person teams
Hyper-focused on the business problem
Co-developing with SMEs and stakeholders
Focus on fast mover strategyFocus on first mover strategy
Designing the ‘supreme’ solution
Timelines that deliver on monthly scales
Aim for perfect accuracy
Large, slow-moving teams
Hyper-focused on the solution
Developing in silos
32
BUSINESS READINESS
TECHNICALCAPABILITY
c
Experimentation
Business leaders
are exploring the
landscape, talking
to vendors, etc.
Clean Data
Data is reliable and
accurate for deep
analysis and
Modeling
Established Data
Governance
Accountable and
consistent standards are
implemented
Proof of Value
Real and measurable
prototypes are scoped
and built for technical
understanding and
business value
Modern Data
Architecture
Data is no longer
slow or siloed
thanks to next-gen
technology stacks
and business
stakeholder buy-in
Scalable Machine
Learning
Teams, technologies,
and techniques are
highly efficient at
building, deploying, and
managing data
pipelines across the
enterprise
AI Adoption
AI has been
seamlessly
integrated into
enterprise processes
and technologies
THE JOURNEY TO
AI ADOPTION
THANK YOU!
Any Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

Data Strategy Best Practices
Data Strategy Best PracticesData Strategy Best Practices
Data Strategy Best Practices
DATAVERSITY
 

Was ist angesagt? (20)

Data Management vs Data Strategy
Data Management vs Data StrategyData Management vs Data Strategy
Data Management vs Data Strategy
 
What is data engineering?
What is data engineering?What is data engineering?
What is data engineering?
 
Data Quality & Data Governance
Data Quality & Data GovernanceData Quality & Data Governance
Data Quality & Data Governance
 
Data Architecture Strategies: Data Architecture for Digital Transformation
Data Architecture Strategies: Data Architecture for Digital TransformationData Architecture Strategies: Data Architecture for Digital Transformation
Data Architecture Strategies: Data Architecture for Digital Transformation
 
The Importance of DataOps in a Multi-Cloud World
The Importance of DataOps in a Multi-Cloud WorldThe Importance of DataOps in a Multi-Cloud World
The Importance of DataOps in a Multi-Cloud World
 
Data Strategy Best Practices
Data Strategy Best PracticesData Strategy Best Practices
Data Strategy Best Practices
 
Application Portfolio Rationalization
Application Portfolio RationalizationApplication Portfolio Rationalization
Application Portfolio Rationalization
 
Data Modeling, Data Governance, & Data Quality
Data Modeling, Data Governance, & Data QualityData Modeling, Data Governance, & Data Quality
Data Modeling, Data Governance, & Data Quality
 
Building Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta LakeBuilding Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta Lake
 
Data Quality Best Practices
Data Quality Best PracticesData Quality Best Practices
Data Quality Best Practices
 
Building the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump InBuilding the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump In
 
Journey to Cloud Analytics
Journey to Cloud Analytics Journey to Cloud Analytics
Journey to Cloud Analytics
 
Data Products and teams
Data Products and teamsData Products and teams
Data Products and teams
 
The data quality challenge
The data quality challengeThe data quality challenge
The data quality challenge
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
 
Augmenting IT strategy with Enterprise architecture assessment
Augmenting IT strategy with Enterprise architecture assessmentAugmenting IT strategy with Enterprise architecture assessment
Augmenting IT strategy with Enterprise architecture assessment
 
(The life of a) Data engineer
(The life of a) Data engineer(The life of a) Data engineer
(The life of a) Data engineer
 
DATA MART APPROCHES TO ARCHITECTURE
DATA MART APPROCHES TO ARCHITECTUREDATA MART APPROCHES TO ARCHITECTURE
DATA MART APPROCHES TO ARCHITECTURE
 
Creating a data driven culture
Creating a data driven cultureCreating a data driven culture
Creating a data driven culture
 
8 Steps to Creating a Data Strategy
8 Steps to Creating a Data Strategy8 Steps to Creating a Data Strategy
8 Steps to Creating a Data Strategy
 

Ähnlich wie Why Data Science Projects Fail

Getting Data Quality Right
Getting Data Quality RightGetting Data Quality Right
Getting Data Quality Right
DATAVERSITY
 

Ähnlich wie Why Data Science Projects Fail (20)

Five Attributes to a Successful Big Data Strategy
Five Attributes to a Successful Big Data StrategyFive Attributes to a Successful Big Data Strategy
Five Attributes to a Successful Big Data Strategy
 
Is Your Agency Data Challenged?
Is Your Agency Data Challenged?Is Your Agency Data Challenged?
Is Your Agency Data Challenged?
 
Challenges in adapting predictive analytics
Challenges  in  adapting  predictive  analyticsChallenges  in  adapting  predictive  analytics
Challenges in adapting predictive analytics
 
Fate of the Chief Data Officer
Fate of the Chief Data OfficerFate of the Chief Data Officer
Fate of the Chief Data Officer
 
Why data governance is the new buzz?
Why data governance is the new buzz?Why data governance is the new buzz?
Why data governance is the new buzz?
 
Getting Data Quality Right
Getting Data Quality RightGetting Data Quality Right
Getting Data Quality Right
 
2013 ALPFA Leadership Submit, Data Analytics in Practice
2013 ALPFA Leadership Submit, Data Analytics in Practice2013 ALPFA Leadership Submit, Data Analytics in Practice
2013 ALPFA Leadership Submit, Data Analytics in Practice
 
Data-Ed Webinar: Data Quality Engineering
Data-Ed Webinar: Data Quality EngineeringData-Ed Webinar: Data Quality Engineering
Data-Ed Webinar: Data Quality Engineering
 
2014 dqe handouts
2014 dqe handouts2014 dqe handouts
2014 dqe handouts
 
Most Common Data Governance Challenges in the Digital Economy
Most Common Data Governance Challenges in the Digital EconomyMost Common Data Governance Challenges in the Digital Economy
Most Common Data Governance Challenges in the Digital Economy
 
Stop the madness - Never doubt the quality of BI again using Data Governance
Stop the madness - Never doubt the quality of BI again using Data GovernanceStop the madness - Never doubt the quality of BI again using Data Governance
Stop the madness - Never doubt the quality of BI again using Data Governance
 
All Together Now: A Recipe for Successful Data Governance
All Together Now: A Recipe for Successful Data GovernanceAll Together Now: A Recipe for Successful Data Governance
All Together Now: A Recipe for Successful Data Governance
 
Keeping the Pulse of Your Data:  Why You Need Data Observability 
Keeping the Pulse of Your Data:  Why You Need Data Observability Keeping the Pulse of Your Data:  Why You Need Data Observability 
Keeping the Pulse of Your Data:  Why You Need Data Observability 
 
Deliveinrg explainable AI
Deliveinrg explainable AIDeliveinrg explainable AI
Deliveinrg explainable AI
 
5 ways to get more from data science
5 ways to get more from data science5 ways to get more from data science
5 ways to get more from data science
 
OAUG 05-2009-MDM-1683-A Fiteni CPA, CMA
OAUG 05-2009-MDM-1683-A Fiteni CPA, CMAOAUG 05-2009-MDM-1683-A Fiteni CPA, CMA
OAUG 05-2009-MDM-1683-A Fiteni CPA, CMA
 
Chief Data Officer
Chief Data OfficerChief Data Officer
Chief Data Officer
 
Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field
 
Group 2 Handling and Processing of big data.pptx
Group 2 Handling and Processing of big data.pptxGroup 2 Handling and Processing of big data.pptx
Group 2 Handling and Processing of big data.pptx
 
12 Guidelines For Success in Data Quality Projects
12 Guidelines For Success in Data Quality Projects12 Guidelines For Success in Data Quality Projects
12 Guidelines For Success in Data Quality Projects
 

Mehr von Sense Corp

Mehr von Sense Corp (8)

The Future of the Digital Experience: How to Embrace the New Order of Busines...
The Future of the Digital Experience: How to Embrace the New Order of Busines...The Future of the Digital Experience: How to Embrace the New Order of Busines...
The Future of the Digital Experience: How to Embrace the New Order of Busines...
 
Achieve New Heights with Modern Analytics
Achieve New Heights with Modern AnalyticsAchieve New Heights with Modern Analytics
Achieve New Heights with Modern Analytics
 
Why Data Science Projects Fail
Why Data Science Projects FailWhy Data Science Projects Fail
Why Data Science Projects Fail
 
Small Investments, Big Returns: Three Successful Data Science Use Cases
Small Investments, Big Returns: Three Successful Data Science Use CasesSmall Investments, Big Returns: Three Successful Data Science Use Cases
Small Investments, Big Returns: Three Successful Data Science Use Cases
 
10 Steps to Develop a Data Literate Workforce
10 Steps to Develop a Data Literate Workforce10 Steps to Develop a Data Literate Workforce
10 Steps to Develop a Data Literate Workforce
 
Managing Large Amounts of Data with Salesforce
Managing Large Amounts of Data with SalesforceManaging Large Amounts of Data with Salesforce
Managing Large Amounts of Data with Salesforce
 
The Data Warehouse is NOT Dead
The Data Warehouse is NOT DeadThe Data Warehouse is NOT Dead
The Data Warehouse is NOT Dead
 
Infographic data
Infographic dataInfographic data
Infographic data
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 

Why Data Science Projects Fail

  • 2. Gabriella Lio Data Scientist Glio@sensecorp.com 2 BBA – McCombs School of Business Management Information Systems MS – McCombs School of Business Business Analytics
  • 3. AGENDA 1. STATE OF DATA SCIENCE OVERVIEW 2. WHY DATA SCIENCE PROJECTS FAIL 3. PROJECT DO’S AND DON’TS
  • 4. Data science literacy is growing across business disciplines and is becoming critical for nearly all enterprise job titles 4 Data Science Adoption Across Roles
  • 5. DATA IS INACCURATE, SILOED, AND SLOW Successful data science initiatives rely on aligning data quality, master data management, and data governance throughout your organization to ensure they are fully integrated and working together. LACK OF BUSINESS READINESS Establish a clear and honest understanding of requirements and capabilities needed to take on data science initiatives. While investing in technology and people, also conduct thorough due diligence to define achievable use cases. OPERATIONALIZATION IS UNREACHABLE Set yourself up for success by investing in business modernization. Make sure your technology stack is up to date, data pipelines and processes are scalable, and data scientists & engineers collaborate. WHY DATA SCIENCE PROJECTS FAIL 87% of data science projects never make it into production.
  • 6. DATA IS INACCURATE, SILOED, AND SLOW Successful data science initiatives rely on aligning data quality, master data management, and data governance throughout your organization to ensure they are fully integrated and working together. LACK OF BUSINESS READINESS Establish a clear and honest understanding of requirements and capabilities needed to take on data science initiatives. While investing in technology and people, also conduct thorough due diligence to define achievable use cases. OPERATIONALIZATION IS UNREACHABLE Set yourself up for success by investing in business modernization. Make sure your technology stack is up to date, data pipelines and processes are scalable, and data scientists & engineers collaborate. WHY DATA SCIENCE PROJECTS FAIL 87% of data science projects never make it into production.
  • 7. Data is Inaccurate, Siloed, and Slow Highly defined process with multiple steps needed to create, monitor, and deliver clean water 7 Delivery of clean data generally lacks the required level of rigor and investment in processes, technologies, and resources CLEAN WATER CLEAN DATA
  • 8. How do we get clean data that is available across the organization? • Process that begins with Data Governance (DG), incorporates Data Quality (DQ), and finally leverages Master Data Management (MDM) • Most companies focus on only one or some of these efforts without coupling them together 8
  • 9. Data Governance Data Governance is the exercise of authority and control (planning, monitoring, and enforcement) over the management of data assets. 9
  • 10. Cleveland and the Cuyahoga River 10
  • 11. Data Quality Across 6 Key Dimensions Key Contributors of Data Quality Issues 1. Source System Issues. Sub-optimal system configuration and fields not being used for intended purposes 2. Data Input Errors. Missing data or Freeform fields may be left blank or populated with incorrect data. Additionally fields may not always end up being populated with data or populated at the right time 3. Proliferation of Redundant Data. With limited availability of certified data, different teams source their own data leading to multiple copies. 4. Inconsistent Usage. Without a defined set of enterprise- wide metrics, data is often defined and used in varied ways (e.g. different KPIs, different source sets of data) 5. Lack of Data Auditing. Little to no visibility into the actual data quality or enforcement to improve the data quality 11
  • 12. Master Data Management • DQ can be considered a separate discipline, many MDM technology providers today include DQ within their MDM technology offering • DQ and MDM can only be successful when operating under a well implemented Data Governance program 12 ERP system CRM System Claims System Rules are applied to determine golden record to ensure alignment around common use of data Gabby Lio 1709 Tree Drive Austin TX 78745 10-31-1990 Gaby Lio 1907 Steele Ct. Austin TX 78789 10-31-1990 Gabriella Lio 1709 Tree Drive Austin TX 78745 10-30-1990 Master Data Management is a technology driven discipline that allows companies to accurately combine data from multiple data sources; It is used to create the master definition for data domains and to drive consistent use of high-integrity data across the company
  • 13. Data Governance in the Age of AI 13 • When building a predictive model, data scientists spend most of their time cleaning and identifying data to use • Profiling the data • The worse the quality of the data you train with, the worse the result of the AI • AI projects shouldn’t be started until you know you have good data • Good data in, great decisions out • Privacy: AI system must comply with privacy laws that require transparency about the collection, use, and storage of data • Fairness: Minimizing bias in our data SAVES TIME GARBAGE IN GARBAGE OUT ETHICAL AI
  • 14. The Risk of Bad Data 14
  • 15. DATA IS INACCURATE, SILOED, AND SLOW Successful data science initiatives rely on aligning data quality, master data management, and data governance throughout your organization to ensure they are fully integrated and working together. LACK OF BUSINESS READINESS Establish a clear and honest understanding of requirements and capabilities needed to take on data science initiatives. While investing in technology and people, also conduct thorough due diligence to define achievable use cases. OPERATIONALIZATION IS UNREACHABLE Set yourself up for success by investing in business modernization. Make sure your technology stack is up to date, data pipelines and processes are scalable, and data scientists & engineers collaborate. WHY DATA SCIENCE PROJECTS FAIL 87% of data science projects never make it into production.
  • 16. Lack of Business Readiness • Organizations often lack the necessary analytic team structure to: 1. Best enable a data driven culture 2. Realize the full potential, and ROI, of analytical capabilities • Companies rarely lack data, tools, or technologies • More of a people and process issue • Purposefully choosing an organizational strategy is one of the first and foremost decisions and analytics leader can make 16 PEOPLE PROCESS TECHNOLOGY
  • 17. Organizational Data Science Strategies 17 Decentralized CentralizedSemi-centralized Benefits • Subject matter expertise quickly available/accessible • Analytics functions and teams are closely aligned to business, issues, and customers Challenges • Redundancy in physical resources and talent • Inconsistency in process, results, and tools • Focus on local issues • No standardization and not leveraging scale Benefits • Shared services, processes, tools, and methodologies • On-demand provisioning and better cost control • Continuous improvement is likely as efforts are focused on iteratively improving a core business Challenges • Less transparent allocation resources among different initiatives • Tends to bias certain business units • Difficulty in cross-functional alignment and consensus Benefits • Shared services, processes, tools, and methodologies • On-demand provisioning and better cost control • Best positioned for long term innovation and value by being removed from day-to-day fires of business units Challenges • Requires CXO-level commitment and investment to empower fast and effective organizational adoption • Business and subject matter expertise requires more effort, engagement, and evangelism to attain
  • 18. Defining Achievable Use Cases in 3 Steps List out potential use cases • A question that can be answered using data • Looking for an answer, an explanation, or just validation • Steer away from bias towards things only YOU know about and bias towards things people think are too hard or impossible Evaluate each use case • Level of Effort/Technical Feasibility • Business Value Prioritize Use Cases • Low Level of Effort/High Technical Feasibility coupled with high busines value is a good place to start 18
  • 19. Evaluating DS Use Cases Examples 19
  • 20. DATA IS INACCURATE, SILOED, AND SLOW Successful data science initiatives rely on aligning data quality, master data management, and data governance throughout your organization to ensure they are fully integrated and working together. LACK OF BUSINESS READINESS Establish a clear and honest understanding of requirements and capabilities needed to take on data science initiatives. While investing in technology and people, also conduct thorough due diligence to define achievable use cases. OPERATIONALIZATION IS UNREACHABLE Set yourself up for success by investing in business modernization. Make sure your technology stack is up to date, data pipelines and processes are scalable, and data scientists & engineers collaborate. WHY DATA SCIENCE PROJECTS FAIL 87% of data science projects never make it into production.
  • 21. Building vs. Scalable Machine Learning 21 BUILDING MACHINE LEARNING SCALING MACHINE LEARNING COMMON TOOLS USED Scikit-Learn, Pandas, Jupyter, Local Enviornment Mlflow, MLlib, Spark, IDEs, DVC, Cloud Enviornment MODEL TRAINING AND PREDICTION Managed by data scientists Automatically orchestrated DEPLOYED Not deployed Deployed in production MODEL VALIDATION Manual Automated
  • 22. What do we need to achieve Operationalization? Storage • Volume of data is growing • Need somewhere to put all this data Robust Data • Need data from different sources (i.e. CRM, ERP, Spreadsheets) • Across the business (i.e. HR, Finance, Customer) • Historical • Readily available Compute • High performing data processing • Processing power to drive out our analysis Output • Communicating Findings • Graphs/Charts • Presentations 22 Model Deployment • Testing • Automated Deployment • Ethics in AI o Trusted model o Fair model Model Management • Statistical Process Control • Data Drift and Model Drift • Stale Models
  • 23. Technology Stack is Up-To-Date 23 Highly scalable, managed cloud data warehouses enable you to store TBs of data with just a few lines of SQL and no infrastructure On demand pricing means technology is affordable for everyone, with only a few minutes of set up time Examples: Amazon Redshift, Google BigQuery, Snowflake, Azure Synapse Ensures you have the fuel to power your warehouse and tools Without data, you have nothing to analyze Especially important when giving real-time predictions and analysis on streaming data Examples: Apache Kafka, Apache Airflow, Confluent, Spark, Python, REST APIs Need a framework for the entire life cycle of a data science project Platform contains all the tools required for executing the lifecycle of the data science project spanning across different phases Examples: Python, R, Apache Spark, Anaconda, Databricks, H2O.ai, Alteryx, Domino In the world of Big Data, data visualization tools and technologies are essential to analyze massive amounts of information and make data- driven decisions Examples: Matplotlib, Tableau, Power BI, Plotly, D3, QlickView DATA WAREHOUSES DATA PIPELINES ANALYTICAL TOOLS VISUALIZATIONS
  • 24. Collaboration between Data Scientists & Data Engineering • Data Engineering involves collecting relevant data. They move and transform this data into “pipelines” for the Data Science team. • Data Scientists analyze, test, aggregate, optimize the data and present it for the company. • Some companies with advanced processes complete their teams with AI Engineers, Machine Learning Engineers or Deep Learning Engineers. 24 It becomes quite understandable that all these tasks have to be divided and given to specific data professionals.
  • 25. Collaboration between Data Scientists & Data Engineers 25 Data Engineering Skills Analytical Skills Data Engineering Data Scientist • A data engineering resource can do some basic to intermediate level analytics but will be hard pressed to do the advanced analytics that a data scientist does. • Having a data scientist create a data pipeline is at the far edge of their skills but is the bread and butter of a data engineering resource. • The two roles are complementary, with data engineering resources supporting the work of data scientists. Both a data scientist and data engineering resources overlap on engineering and analysis.
  • 26. What do you do when you notice… Implement Data Governance, which will enable Data Quality and Master Data Management Create an organizational strategy for data science that works for your company and prioritize use cases iteratively Realize the difference between building and scaling machine learning models, update your technology stack, and make sure data scientists collaborate with data engineering resources 3 Key Takeaways 26 Data is inaccurate, siloed, and slow? There is a lack of business readiness? Operationalization is unreachable?
  • 27. { } Survey the Audience Discovering Project Do’s and Don’ts
  • 28. 28 When designing a solution is your team more focused on… orDesigning the ‘supreme’ solution Beginning on the solution early, being agile, and starting small
  • 29. 29 What is the average timeline for deliverables on data science projects you have been apart of? orTimelines that deliver on weekly scales Timelines that deliver on monthly scales
  • 30. 30 When engaging in a project is your team... orHyper-focused on the business problem Hyper-focused on the solution
  • 31. PROJECT DO’S AND DONT’S Begin early, be agile, and start small Timelines that deliver on weekly scales Aim for “good enough’ & adding business value 4-6 person teams Hyper-focused on the business problem Co-developing with SMEs and stakeholders Focus on fast mover strategyFocus on first mover strategy Designing the ‘supreme’ solution Timelines that deliver on monthly scales Aim for perfect accuracy Large, slow-moving teams Hyper-focused on the solution Developing in silos
  • 32. 32 BUSINESS READINESS TECHNICALCAPABILITY c Experimentation Business leaders are exploring the landscape, talking to vendors, etc. Clean Data Data is reliable and accurate for deep analysis and Modeling Established Data Governance Accountable and consistent standards are implemented Proof of Value Real and measurable prototypes are scoped and built for technical understanding and business value Modern Data Architecture Data is no longer slow or siloed thanks to next-gen technology stacks and business stakeholder buy-in Scalable Machine Learning Teams, technologies, and techniques are highly efficient at building, deploying, and managing data pipelines across the enterprise AI Adoption AI has been seamlessly integrated into enterprise processes and technologies THE JOURNEY TO AI ADOPTION

Hinweis der Redaktion

  1. Good Afternoon, I want to firstly start off by thanking everyone for joining us today. My name is Gaby Lio, and I am a data scientist at Sense Corp. We have worked with multiple fortune 500 companies, sharing and implementing data driven solutions and I have plenty of scar tissue around why data science projects can be successful and why they can also fail, so im excited to be speaking with you all today and lets dive right in.
  2. Before we dive into Why Data Science Projects Are Failing, I want to start with looking at the current state of data science and how rapid the adoption of AI is becoming across all industries and roles to paint a better picture of the importance of these projects succeeding. According to the Anaconda State of Data Science, a Survey of the Anaconda community painted an interesting picture highlighting the types of jobs held by data science learners and the results showed that there is adoption across every role…you can see a revolution is happening….with interest in data science spanning across a very broad range of job functions…this signals that these professionals are increasing their data literacy, and will be able to adapt to a data driven business model, where machine learning is incorporated in their day to day functions. They are ready for it so why isn’t this adoption spreading faster and being implemented across every organization today?
  3. The answer is that Data Science projects are failing at an alarming rate. Depending on who you ask, most industry survey’s will site that nearly 9 out of 10 data science projects fail, and we can attribute this failure to three specific reasons. The first factor revolves around your data.. Having your data in silos prevents employees across the organization from accessing a set or source of data, while inaccurate data can lead to inaccurate decision making and eventually a loss in revenue. Furthermore, if the speed at which your data is digested and made available to you is slow, real time analytics will never be an option. Therefore, successful data science initiatives rely on aligning data quality, master data management, and data governance to ensure all three are integrated and fully working together to prevent inaccurate, siloed, and slow data. The second factor is a lack of business readiness. There is often a lack of an honest understanding of requirements and capabilities needed to take on data science initiatives. We will be tackling the people and process side of business readiness, by touching on how to set up your data science team within your organization and how other teams should be interacting with data scientists. Then well take a deep dive into defining achievable use cases that can be easy wins for you and your team. The last factor attributing to why data science projects fail is centered around operationalization being unreachable. In order to set your team up for success, your company should be investing in business modernization, specifically around making sure the technology stack is up to date, and that data pipelines and processes are scalable. There should also be a clear distinction between roles on the team, where data scientists and data engineers are working together to create and push models into production. I will step through each of these in greater detail, giving you solutions to prevent these common pitfalls.
  4. Let starts with addressing the issue of data.
  5. To better understand why clean data is so important, I am going to be relating clean water, to clean data throughout this section. In our developed world, we take clean water for granted. We simply have to turn the tap on, pour a glass, and drink the water…but it hasn’t always been that way, and it wasn’t a simple process that got us there. We developed technologies such as aqueducts, filters, and water treatment facilities to create and deliver clean water, and now its a standard. So why haven’t we created the standard that our data should be clean? We continue to struggle with clean data because a lot of companies lack the required level of rigor and investment in processes, technologies, and resources to deliver it. We know that dirty water can impact the health of people, yet we don’t easily accept or recognize the impact that dirty data can have on an organization.
  6. So how do we get clean data that is available to all who need it across the organization? It’s a process that begins with Data Governance, incorporates Data Quality, and finally leverages Master Data Management. Most companies only focus on one or some of these efforts without coupling them together. While water can freely roll down hill, data needs to be transported downstream, and it requires a defined and concentrated effort to end up with clean data. Ensuring these three disciplines are aligned organizationally and fully integrated and working together are going to be the key to success.
  7. Lets start with Data Governance. At Sense Corp we define Data Governance as “the exercise of authority and control (planning, monitoring, and enforcement) over the management of data assets.” The framework you see on the screen here represents the various categories that must be considered in order to make any governance effort successful. (read out all of them) But probably the best way to understand governance is through a real-life example of something that happened back in 1969 in Cleveland, Ohio.
  8. For decades in the first half of the 20th century, industrial waste and sewage regularly poured into the Cuyahoga (KAI-A-HOGA) River and residents accepted it as a consequence of the city’s prosperity. But in the 1960s, mindsets started to shift as the population became more environmentally conscious. In the next decade, citizens demands that governance over our natural resources be enacted. How did they do this? After decades of river fires that would burn bridges, boats, and buildings along the shore, citizens demanded change. The Cleveland mayor (acting as a voice of leadership) testified before the US Congress. This led to the formation of the Environmental Protections Agency (EPA), which in part led to the passage of the Clean Water Act. From a governance perspective, Governing bodies were created with authority to tackle the problem. The Clean Water Act was a statute that called for policies and standards. The clean-up was funded through local bonds and federal monies. If you think about the people, the agencies, and the controls, put in place…this is what governance looks like. And these concepts are what we apply to data today to ensure that our data lakes,, rivers, and streams can stay clean and usable for everyone. *** Data Governance is not a project or a program; it’s a core business function that is necessary in order to compete in the 21st-century business climate.***
  9. Just as water needs to go through a comprehensive set of water quality checks before being consumed, data needs to go through data quality checks before being used. There are 6 keys dimension in which Data Quality should be assessed. The first is completeness…is all the data available? What about consistency? Can we match data across sources or datasets? We need to look at uniqueness…is there a single definition of that data? What about Validity…does the data match the rules? You can’t have someone in the system whos age is 200…we know that’s not possible in the real world so why should that be allowed in your systems? The theres Accuracy…is the data correct? And lastly, timeliness…is the data available when needed? All 6 coupled together make up your Data Quality. And many different issues contribute to the quality of your data. Some key contributors are source system issues, data input errors, redundant data, inconsistent usage, and lack of data auditing, all of which can be improved upon with policies and processes set forth in Data Governance. So you can see how it is all inner connected.
  10. And furthermore, a lot of times today you will see Data Quality lumped in with Master Data Management, and that is because a lot of the MDM technology offerings provide Data profiling and data quality tools inside of their offering, but Data Quality indeed is considered a separate discipline. So what specifically is Master Data Management and where does it differ from data quality? It is a technology driven discipline that allows companies to accurately combine data from multiple data sources; it is used to create the master definition for data domains and to drive consistent use of high integrity data across the company. Imagine all of the different data sources used at your company to bring data in. You can have data from an ERP system, from a CRM system, and maybe even a claims systems all representing a single customer in three different systems. With data being captured in different ways, there is inevitably going to be some differences, maybe the person has recently moved so their address is different across systems, or maybe they have a nickname they go by which they put in one system but not the other. MDM is the process of applying rules to determine the golden record to ensure alignment around common use of data. And to bring it full circle, Data Quality and MDM can only be successful when operating under a well implemented Data Governance program.
  11. So why is Data Governance so important in the Age of AI? Firstly it saves time down the road, when building a predictive model, data scientists spend most of their time cleaning and identifying data to use or profiling their data. Imagine having clean data, all accessible in one place, cataloged nicely and ready for you to use. The time savings here would be tremendous. Secondly, we’ve all heard this before, but put garbage into your model and you will get garbage out. The worse the quality of the data you train with, the worse the results of your AI. AI projects shouldn’t even be started until you know you have good data, as good data in leads to great decisions out. And Lastly, a big topic in the AI community right now is creating trust with our models and practicing ethical AI. With Data Governance in place, the privacy of your data being used in these models, along with the fairness of the model can be assured as data governance aids in the transparency around the collection, use and storage of the data as well as minimizing the bias in the data being circulated to those across the organization.
  12. So overall, bad data = bad everything. It effects the bottom line and effects your ability to make accurate decisions. 88% of companies report that inaccurate data had a direct impact on their bottom line, with 12% reporting lost revenue for the average company because of inaccurate data, and not to mention 42% of managers recognized that they have made wrong decisions using bad data. Think about the 1-10-100 rule of clean data….if you had a $1 prevention cost at the point of capture, that would turn into a $10 correction cost downstream if not caught, and would balloon into $100 failure cost at the time of the decision. So although it’s a cheap cost upstream, downstream it compounds! So moral of the story is, put in the work upfront to make sure your data is clean and accessible for all those in the organization.
  13. Now we are going to look at how a lack of business readiness can contributes to data science project failing
  14. Whenever we think about a transformation we think in terms of the people, process, and technology within that transformation. In this transformation towards AI though, we are seeing companies rarely lack data, tools, or things that fit in the technology bucket. There is a plethora of data out there and many open source tools available to start analyzing your data. What most organizations are lacking is centered around a people and process domain. Correctly structuring a data science team within your organization is a huge step that needs to be taken by an analytics leader to enable a data driven culture and help the company realize the full potential of analytical capabilities. What’s even more interesting is that not only does setting up an organizational strategy for data science help secure a spot for data science to grow and flourish inside the organization, but it also helps teams surrounding the data science team in learning how to interact with Data Scientists as they currently don’t know how. Data Scientists have very desirable skill sets. They know how to program , they know how to visualize and analyze data, as well as build predictive and statistical models. Due to their knowledge across multiple domains, they often get pinged and pulled to put out fires, resulting in data science initiatives getting thrown to the back burner, instead of working through a deliberate project scoped out by the business teams.
  15. Lets take a look at the 3 main types of Data Science strategies organizations are using to set up their data science teams for success. The first is a Decentralized strategy - think Finance vs. Sales vs. Product vs. Customer Success, each with their own analytics teams dedicated to and embedded within the function. Some cons of this are that you will have to move and transform data between applications, potentially be doing duplicated work, and working in more of a reactive manner, when they see a problem then they tackle it. The benefits though are that its easy to build subject matter expertise within that area and the analytics functions are closely aligned with the business, issues, and customers. This organization arises commonly in larger organizations where data science initiatives have arisen organically in multiple parts of the business. Now lets jump to the other end of the spectrum and look at Centralized strategy- all quantitative analysts, data engineers and data scientists would report into a central analytics hierarchy, with responsibilities spanning the organization. This is very common and what you may have seen branded as a COE or a center of excellence. Time and resources are managed within that unit to develop technical expertise and modeling capabilities, as opposed to minimizing response time between business question and answer. It’s a very proactive approach. The benefits are shared services, processes, tools and methodologies and being better positioned for long term innovation. Centralized functions can work well in analytically mature organizations, with the time, patience and money to fund what is essentially an internal research capability.  The cons are it requires a large commitment and investment to empower fast and effective organizational adoption, and building subject matter expertise take a lot more effort. Lastly we will look at what falls in between these two spectrums which is a Semi-Centralized strategy - Like a centralized structure, a single organizational data science leadership team sets the organizational data science strategy. Its management team serves as functional managers to hire, develop, and promote data scientists. Sister (or embedded) teams of engineers enable production deployment. However, the data scientists are assigned to (and might even sit with) various business units and focus on the same domain-specific problems. Breath of knowledge can be gained by rotating data scientists among the various centralized sub-teams. In short, the organization gets a centralized infrastructure, a common data science strategy, and effective talent management, and the business units get somewhat dedicated teams who are knowledgeable about their specific needs.  Every organization is at a different part in the journey so there is no right or wrong answer to setting up your data science organizational strategy. They key is to pick a strategy and educate those in the organization how to adopt to that strategy.
  16. The other aspect that is folded into a lack of business readiness is making sure that you defining achievable use cases for your data science teams. This happens in 3 simple steps….firstly you need to list out all the potential use cases. This is the easiest part, there are no guidelines besides it just has to be a question that can be answered using data…and it doesn’t necessarily have to be a straight forward answer either, it can be that you are looking for an explanation or validation. I want to caution you when thinking of these use cases to steer away from things only YOU know about or things YOU may think are impossible. Think of it like an idea brain storming session, throw everything out there and see what sticks, its important to have team members from a diverse background in these discussions, instead of just people from one business unit or expertise. Next is evaluating the use cases. Ill show you a blown up example of this in the next slide, but think of creating a graph with an x and a y axis. On the x axis you have business value and on the y axis you have the level of effort or technical feasibility…..look at the uses cases and plot them on the graph to see where they fall. Visualizing in this way makes it really easy to drive out our last step which is prioritizing the use cases. Now you can see the ones you should tackle first, which are those occupying the high technical feasibility with a High business value space.
  17. So those in the top right corner are the use cases we drove out first to give us a quick win. To enable data science across the organization its better to start with something small that drives business value, than to aim really high and fail, then you give the perception across the organization that data science projects are risky, take a long time to complete, and aren’t even successful. By aiming for the more attainable use cases, you are showing success to get the ball rolling, all the while you are still developing your talent and investing in technologies so that down the line in the future you will be ready to tackle the bigger ones highlighted in red. Its also really important to note that this isnt something that is static either, as you invest in new technologies and your talent grows you can always choose to add more use cases and then reevaluate and re-prioritize according to your current business climate. Its an ever changing cycle that must be iterated upon.
  18. The number one factor contributing to making operationalization unreachable is centered around being able to identify that these two concepts–building machine learning vs. scaling machine learning…are two different set of problems that each have their own set of solutions. This plays a key role in why data science projects are failing. A lot of companies are just aiming to build models, which is a great place to start, but if you want your data science projects to be successful for the long term and integrated into the business, you need to make sure that once they are built, that they can be scaled. Think about when you are building a model, you are normally running the model on your computer in a jupyter notebook. What happens when these models need to go into production and run in real time? Surely what you were building on your local computer will break when scaled into production. Models in production should be running automatically, on a platform that has huge processing power. They should be checked regularly for model drift or to see if the model has become stale through an automated process. These are all considerations you didn’t even need to think about when you were building the model on your machine because there you weren’t deploying anything, they were being run on command and only validated against other models manually. Creating and carrying out a plan to transition the models you built into production is vital if you want the project to succeed.
  19. But this isn’t the only arena that operationalization is composed of. There is a process side and a technology side. And the process side is the one that deals with model deployment and model management, but in order to drive that out you need to invest in the proper technology . Before we dive into what specific technologies you should be investing in, lets take a step back and first understand at a high level the big buckets that we need think about in order to achieve operationalization from a technological standpoint. Storage is the first bucket. Everybody knows the volume of data is growing at a compounding rate, they say by 2025 worldwide data is expected to hit 175 zettabytes (10,000 TB)! So we need somewhere to store all this structured and unstructured data, preferably in space that has room to grow. Secondly, as obvious as it sounds we need robust data. As we learned earlier that data cannot be siloed, inaccurate, or slow …. so we need to make sure we have the proper processes in place to bring data in from multiple systems across the business, even dating back to prior years and making sure that data is readily available and easy to access. Next is compute. Training models on millions of rows of data is no easy task for your computer, and when these models are running in production you need them to be fast, giving real time results, so processing power is very important and should be a factor to be considered when thinking about technologies you will be adopting. Lastly the output of your analysis should be taken into consideration. Think about how you want to communicate your findings. Are you going to display a bunch of code to your project stakeholders to convince them your model should be used to make decisions? Not likely, so investing in a tool that can help you visualize your findings is just as important as the other three buckets.
  20. Now using those four big buckets we just outlined, lets walk through the types of technologies that fit into each category. For Storage, you are going to want to invest in a data warehouse that is highly scalable and in the cloud. You get on demand pricing that is affordable for everyone, minimal set up time, and you don’t have to worry about managing the DB infrastructure. Examples of these warehouses would be tools like Amazon Redshift, Google BigQuery, Snowflake, or Azure Synapse. For achieving the concept of robust data, you are going to want to ensure you have the proper data pipelines in place to bring your data to users across the organization in a timely manner. This powers your warehouses and is especially important when giving real time predictions and analysis on streaming data. Examples of tools you would invest in for this space are Apache Kafka, Airflow, Confluent, Spark, Python, and Rest APIs. Once we have the data available to us for modeling, we need some Analytical tools or platforms to help us process all the data and train or build our models. These tools can even be looked at as a framework for the entire life cycle of a data science project. These would be tools like Python, R, Spark, Anaconda, Databricks, Alteryx, or Domino. Lastly, is how we want to communicate our findings, and visualization tools are the main player in this arena that aide business stakeholders in making decisions. You should be looking at Tableau, PowerBI, Plotly, D3, and Matplotlib. LAST POINT: ****These are the core main tool…but definitely not an all inclusive list…**** Video files, text files, geo database files….there are other types of thing you would be bringing in…NOSQL storage, graph databases****
  21. So we’ve touched on the process and the technology aspects of operationalization, but what about the people. I want to call out how important it is to make sure your data scientists are working with data engineering resources to achieve success. As AI continues to evolve, as do the roles that come with implementing data science initiatives. Data engineering is being used to collect the relevant data and build pipelines to move and transform the data to make it available for the data science team. This role can sometimes be filled by data scientists in smaller organizations, or in larger organizations you may see a specific data engineering resource who has a software engineering background, or other times it is being fulfilled by the IT department. The distinction here is that data scientists may still have to transform the data to fit into their models, but they are mainly analyzing the data using statistical methods to draw insights, leaving the data engineering to other resources who are experienced in that arena.
  22. But although they are distinct roles, the data engineering resources must work closely with the data scientists to streamline capabilities. Asking a data scientist to build a data pipeline is at the far edge of their skills, mean while it’s the bread and butter of the data engineering resource. Data engineering resources use their programming and systems creation skills to create big data pipelines, while Data scientists use their more limited programming skills and apply their advanced math skills to create advanced data products using those existing data pipelines. This difference between creating and using lies at the core of a team’s failure with big data. A team that expects their data scientists to create the data pipelines will be gravely disappointed.
  23. Don’t elaborate. Reference our E-book, interop presentation some overlap..another one coming up subscribe… dive deep into a couple of use cases and why they are successful and how AI applies.
  24. Special peek into our upcoming webinar….Small Investments, Big Returns: Three Successful Data Science Use Cases….which will be Sept 17, so be on the lookout. It will be going over multiple client use cases where we have come in and helped them at a specific part in their Journey or throughout the entirety of their Journey. No journey is alike, and neither is the timeline of climbing towards full AI adoption. The projects range from the manufacturing industry, to the oil and gas industry, and even to the education industry. You wont want to miss it. I very much appreciate your time today, and I look forward to connecting with you all again in the future. If you have any questions please feel free to ask them now and Kelly will help facilitate them.