The document discusses why 87% of data science projects fail to make it into production. It identifies three main reasons for failure: data is inaccurate, siloed and slow; there is a lack of business readiness; and operationalization is unreachable. To address these issues, the document recommends establishing data governance, defining an organizational data science strategy and use cases, ensuring the technology stack is updated, and having data scientists collaborate with data engineers. It also provides tips for successful data science projects, such as having short timelines, small focused teams, and prioritizing business problems over solutions.
3. AGENDA
1. STATE OF DATA
SCIENCE OVERVIEW
2. WHY DATA SCIENCE
PROJECTS FAIL
3. PROJECT DO’S AND
DON’TS
4. Data science literacy is growing
across business disciplines and is
becoming critical for nearly all
enterprise job titles
4
Data Science Adoption Across Roles
5. DATA IS INACCURATE, SILOED, AND SLOW
Successful data science initiatives rely on aligning data quality, master data
management, and data governance throughout your organization to ensure
they are fully integrated and working together.
LACK OF BUSINESS READINESS
Establish a clear and honest understanding of requirements and capabilities
needed to take on data science initiatives. While investing in technology and
people, also conduct thorough due diligence to define achievable use cases.
OPERATIONALIZATION IS UNREACHABLE
Set yourself up for success by investing in business modernization. Make
sure your technology stack is up to date, data pipelines and processes are
scalable, and data scientists & engineers collaborate.
WHY DATA SCIENCE
PROJECTS FAIL 87% of data science projects never
make it into production.
6. DATA IS INACCURATE, SILOED, AND SLOW
Successful data science initiatives rely on aligning data quality, master data
management, and data governance throughout your organization to ensure
they are fully integrated and working together.
LACK OF BUSINESS READINESS
Establish a clear and honest understanding of requirements and capabilities
needed to take on data science initiatives. While investing in technology and
people, also conduct thorough due diligence to define achievable use cases.
OPERATIONALIZATION IS UNREACHABLE
Set yourself up for success by investing in business modernization. Make
sure your technology stack is up to date, data pipelines and processes are
scalable, and data scientists & engineers collaborate.
WHY DATA SCIENCE
PROJECTS FAIL 87% of data science projects never
make it into production.
7. Data is Inaccurate, Siloed, and Slow
Highly defined process with multiple
steps needed to create, monitor, and
deliver clean water
7
Delivery of clean data generally lacks
the required level of rigor and
investment in processes, technologies,
and resources
CLEAN WATER CLEAN DATA
8. How do we get clean data that is available across the
organization?
• Process that begins with Data Governance (DG), incorporates Data Quality (DQ), and
finally leverages Master Data Management (MDM)
• Most companies focus on only one or some of these efforts without coupling them
together
8
9. Data Governance
Data Governance is the exercise of authority and control (planning,
monitoring, and enforcement) over the management of data assets.
9
11. Data Quality Across 6 Key Dimensions
Key Contributors of Data Quality Issues
1. Source System Issues. Sub-optimal system configuration
and fields not being used for intended purposes
2. Data Input Errors. Missing data or Freeform fields may be
left blank or populated with incorrect data. Additionally
fields may not always end up being populated with data or
populated at the right time
3. Proliferation of Redundant Data. With limited availability
of certified data, different teams source their own data
leading to multiple copies.
4. Inconsistent Usage. Without a defined set of enterprise-
wide metrics, data is often defined and used in varied
ways (e.g. different KPIs, different source sets of data)
5. Lack of Data Auditing. Little to no visibility into the actual
data quality or enforcement to improve the data quality
11
12. Master Data Management
• DQ can be considered a separate discipline, many MDM technology providers today
include DQ within their MDM technology offering
• DQ and MDM can only be successful when operating under a well implemented Data
Governance program
12
ERP system
CRM System
Claims System
Rules are applied to
determine golden
record to ensure
alignment around
common use of data
Gabby Lio 1709 Tree
Drive
Austin TX 78745 10-31-1990
Gaby Lio 1907 Steele
Ct.
Austin TX 78789 10-31-1990
Gabriella Lio 1709 Tree
Drive
Austin TX 78745 10-30-1990
Master Data Management is a technology driven discipline that allows companies to accurately combine data
from multiple data sources; It is used to create the master definition for data domains and to drive consistent use
of high-integrity data across the company
13. Data Governance in the Age of AI
13
• When building a predictive
model, data scientists spend
most of their time cleaning
and identifying data to use
• Profiling the data
• The worse the quality of the
data you train with, the
worse the result of the AI
• AI projects shouldn’t be
started until you know you
have good data
• Good data in, great decisions
out
• Privacy: AI system must
comply with privacy laws
that require transparency
about the collection, use,
and storage of data
• Fairness: Minimizing bias in
our data
SAVES TIME GARBAGE IN GARBAGE OUT ETHICAL AI
15. DATA IS INACCURATE, SILOED, AND SLOW
Successful data science initiatives rely on aligning data quality, master data
management, and data governance throughout your organization to ensure
they are fully integrated and working together.
LACK OF BUSINESS READINESS
Establish a clear and honest understanding of requirements and capabilities
needed to take on data science initiatives. While investing in technology and
people, also conduct thorough due diligence to define achievable use cases.
OPERATIONALIZATION IS UNREACHABLE
Set yourself up for success by investing in business modernization. Make
sure your technology stack is up to date, data pipelines and processes are
scalable, and data scientists & engineers collaborate.
WHY DATA SCIENCE
PROJECTS FAIL 87% of data science projects never
make it into production.
16. Lack of Business Readiness
• Organizations often lack the
necessary analytic team structure to:
1. Best enable a data driven culture
2. Realize the full potential, and ROI, of
analytical capabilities
• Companies rarely lack data, tools, or
technologies
• More of a people and process issue
• Purposefully choosing an
organizational strategy is one of the
first and foremost decisions and
analytics leader can make
16
PEOPLE
PROCESS
TECHNOLOGY
17. Organizational Data Science Strategies
17
Decentralized CentralizedSemi-centralized
Benefits
• Subject matter expertise quickly available/accessible
• Analytics functions and teams are closely aligned to
business, issues, and customers
Challenges
• Redundancy in physical resources and talent
• Inconsistency in process, results, and tools
• Focus on local issues
• No standardization and not leveraging scale
Benefits
• Shared services, processes, tools, and methodologies
• On-demand provisioning and better cost control
• Continuous improvement is likely as efforts are
focused on iteratively improving a core business
Challenges
• Less transparent allocation resources among
different initiatives
• Tends to bias certain business units
• Difficulty in cross-functional alignment and
consensus
Benefits
• Shared services, processes, tools, and methodologies
• On-demand provisioning and better cost control
• Best positioned for long term innovation and value by being
removed from day-to-day fires of business units
Challenges
• Requires CXO-level commitment and investment to
empower fast and effective organizational adoption
• Business and subject matter expertise requires more effort,
engagement, and evangelism to attain
18. Defining Achievable Use Cases in 3 Steps
List out potential
use cases
• A question that can be
answered using data
• Looking for an answer,
an explanation, or just
validation
• Steer away from bias
towards things only
YOU know about and
bias towards things
people think are too
hard or impossible
Evaluate each use
case
• Level of
Effort/Technical
Feasibility
• Business Value
Prioritize Use Cases
• Low Level of
Effort/High Technical
Feasibility coupled with
high busines value is a
good place to start
18
20. DATA IS INACCURATE, SILOED, AND SLOW
Successful data science initiatives rely on aligning data quality, master data
management, and data governance throughout your organization to ensure
they are fully integrated and working together.
LACK OF BUSINESS READINESS
Establish a clear and honest understanding of requirements and capabilities
needed to take on data science initiatives. While investing in technology and
people, also conduct thorough due diligence to define achievable use cases.
OPERATIONALIZATION IS UNREACHABLE
Set yourself up for success by investing in business modernization. Make
sure your technology stack is up to date, data pipelines and processes are
scalable, and data scientists & engineers collaborate.
WHY DATA SCIENCE
PROJECTS FAIL 87% of data science projects never
make it into production.
21. Building vs. Scalable Machine Learning
21
BUILDING MACHINE LEARNING SCALING MACHINE LEARNING
COMMON TOOLS USED
Scikit-Learn, Pandas, Jupyter,
Local Enviornment
Mlflow, MLlib, Spark, IDEs, DVC,
Cloud Enviornment
MODEL TRAINING AND
PREDICTION
Managed by data scientists Automatically orchestrated
DEPLOYED Not deployed Deployed in production
MODEL VALIDATION Manual Automated
22. What do we need to achieve Operationalization?
Storage
• Volume of data is
growing
• Need somewhere to put
all this data
Robust Data
• Need data from different
sources (i.e. CRM, ERP,
Spreadsheets)
• Across the business (i.e.
HR, Finance, Customer)
• Historical
• Readily available
Compute
• High performing data
processing
• Processing power to
drive out our analysis
Output
• Communicating Findings
• Graphs/Charts
• Presentations
22
Model Deployment
• Testing
• Automated Deployment
• Ethics in AI
o Trusted model
o Fair model
Model Management
• Statistical Process Control
• Data Drift and Model Drift
• Stale Models
23. Technology Stack is Up-To-Date
23
Highly scalable, managed
cloud data warehouses enable
you to store TBs of data with
just a few lines of SQL and no
infrastructure
On demand pricing means
technology is affordable for
everyone, with only a few
minutes of set up time
Examples: Amazon Redshift,
Google BigQuery, Snowflake,
Azure Synapse
Ensures you have the fuel to
power your warehouse and
tools
Without data, you have
nothing to analyze
Especially important when
giving real-time predictions
and analysis on streaming data
Examples: Apache Kafka,
Apache Airflow, Confluent,
Spark, Python, REST APIs
Need a framework for the
entire life cycle of a data
science project
Platform contains all the tools
required for executing the
lifecycle of the data science
project spanning across
different phases
Examples: Python, R, Apache
Spark, Anaconda, Databricks,
H2O.ai, Alteryx, Domino
In the world of Big Data, data
visualization tools and
technologies are essential to
analyze massive amounts of
information and make data-
driven decisions
Examples: Matplotlib,
Tableau, Power BI, Plotly, D3,
QlickView
DATA WAREHOUSES DATA PIPELINES ANALYTICAL TOOLS VISUALIZATIONS
24. Collaboration between Data Scientists & Data Engineering
• Data Engineering involves
collecting relevant data. They
move and transform this data
into “pipelines” for the Data
Science team.
• Data Scientists analyze, test,
aggregate, optimize the data
and present it for the company.
• Some companies with
advanced processes complete
their teams with AI Engineers,
Machine Learning Engineers or
Deep Learning Engineers.
24
It becomes quite understandable that all
these tasks have to be divided and given to
specific data professionals.
25. Collaboration between Data Scientists & Data Engineers
25
Data Engineering Skills Analytical Skills
Data Engineering
Data Scientist
• A data engineering resource can do some basic to intermediate level analytics
but will be hard pressed to do the advanced analytics that a data scientist does.
• Having a data scientist create a data pipeline is at the far edge of their skills but
is the bread and butter of a data engineering resource.
• The two roles are complementary, with data engineering resources supporting
the work of data scientists.
Both a data scientist and data
engineering resources overlap
on engineering and analysis.
26. What do you do when you notice…
Implement Data Governance,
which will enable Data Quality
and Master Data
Management
Create an organizational
strategy for data science that
works for your company and
prioritize use cases iteratively
Realize the difference
between building and scaling
machine learning models,
update your technology stack,
and make sure data scientists
collaborate with data
engineering resources
3 Key Takeaways
26
Data is inaccurate, siloed,
and slow?
There is a lack of business
readiness?
Operationalization is
unreachable?
27. { }
Survey the Audience
Discovering Project Do’s and Don’ts
28. 28
When designing a solution is your team more focused on…
orDesigning the
‘supreme’ solution
Beginning on the
solution early, being
agile, and starting
small
29. 29
What is the average timeline for deliverables on data science
projects you have been apart of?
orTimelines that deliver
on weekly scales
Timelines that deliver
on monthly scales
30. 30
When engaging in a project is your team...
orHyper-focused on the
business problem
Hyper-focused on the
solution
31. PROJECT DO’S AND DONT’S
Begin early, be agile, and start small
Timelines that deliver on weekly scales
Aim for “good enough’ & adding business value
4-6 person teams
Hyper-focused on the business problem
Co-developing with SMEs and stakeholders
Focus on fast mover strategyFocus on first mover strategy
Designing the ‘supreme’ solution
Timelines that deliver on monthly scales
Aim for perfect accuracy
Large, slow-moving teams
Hyper-focused on the solution
Developing in silos
32. 32
BUSINESS READINESS
TECHNICALCAPABILITY
c
Experimentation
Business leaders
are exploring the
landscape, talking
to vendors, etc.
Clean Data
Data is reliable and
accurate for deep
analysis and
Modeling
Established Data
Governance
Accountable and
consistent standards are
implemented
Proof of Value
Real and measurable
prototypes are scoped
and built for technical
understanding and
business value
Modern Data
Architecture
Data is no longer
slow or siloed
thanks to next-gen
technology stacks
and business
stakeholder buy-in
Scalable Machine
Learning
Teams, technologies,
and techniques are
highly efficient at
building, deploying, and
managing data
pipelines across the
enterprise
AI Adoption
AI has been
seamlessly
integrated into
enterprise processes
and technologies
THE JOURNEY TO
AI ADOPTION
Good Afternoon, I want to firstly start off by thanking everyone for joining us today. My name is Gaby Lio, and I am a data scientist at Sense Corp. We have worked with multiple fortune 500 companies, sharing and implementing data driven solutions and I have plenty of scar tissue around why data science projects can be successful and why they can also fail, so im excited to be speaking with you all today and lets dive right in.
Before we dive into Why Data Science Projects Are Failing, I want to start with looking at the current state of data science and how rapid the adoption of AI is becoming across all industries and roles to paint a better picture of the importance of these projects succeeding. According to the Anaconda State of Data Science, a Survey of the Anaconda community painted an interesting picture highlighting the types of jobs held by data science learners and the results showed that there is adoption across every role…you can see a revolution is happening….with interest in data science spanning across a very broad range of job functions…this signals that these professionals are increasing their data literacy, and will be able to adapt to a data driven business model, where machine learning is incorporated in their day to day functions. They are ready for it so why isn’t this adoption spreading faster and being implemented across every organization today?
The answer is that Data Science projects are failing at an alarming rate. Depending on who you ask, most industry survey’s will site that nearly 9 out of 10 data science projects fail, and we can attribute this failure to three specific reasons.
The first factor revolves around your data.. Having your data in silos prevents employees across the organization from accessing a set or source of data, while inaccurate data can lead to inaccurate decision making and eventually a loss in revenue. Furthermore, if the speed at which your data is digested and made available to you is slow, real time analytics will never be an option. Therefore, successful data science initiatives rely on aligning data quality, master data management, and data governance to ensure all three are integrated and fully working together to prevent inaccurate, siloed, and slow data.
The second factor is a lack of business readiness. There is often a lack of an honest understanding of requirements and capabilities needed to take on data science initiatives. We will be tackling the people and process side of business readiness, by touching on how to set up your data science team within your organization and how other teams should be interacting with data scientists. Then well take a deep dive into defining achievable use cases that can be easy wins for you and your team.
The last factor attributing to why data science projects fail is centered around operationalization being unreachable. In order to set your team up for success, your company should be investing in business modernization, specifically around making sure the technology stack is up to date, and that data pipelines and processes are scalable. There should also be a clear distinction between roles on the team, where data scientists and data engineers are working together to create and push models into production.
I will step through each of these in greater detail, giving you solutions to prevent these common pitfalls.
Let starts with addressing the issue of data.
To better understand why clean data is so important, I am going to be relating clean water, to clean data throughout this section.
In our developed world, we take clean water for granted. We simply have to turn the tap on, pour a glass, and drink the water…but it hasn’t always been that way, and it wasn’t a simple process that got us there.
We developed technologies such as aqueducts, filters, and water treatment facilities to create and deliver clean water, and now its a standard. So why haven’t we created the standard that our data should be clean? We continue to struggle with clean data because a lot of companies lack the required level of rigor and investment in processes, technologies, and resources to deliver it. We know that dirty water can impact the health of people, yet we don’t easily accept or recognize the impact that dirty data can have on an organization.
So how do we get clean data that is available to all who need it across the organization? It’s a process that begins with Data Governance, incorporates Data Quality, and finally leverages Master Data Management. Most companies only focus on one or some of these efforts without coupling them together.
While water can freely roll down hill, data needs to be transported downstream, and it requires a defined and concentrated effort to end up with clean data. Ensuring these three disciplines are aligned organizationally and fully integrated and working together are going to be the key to success.
Lets start with Data Governance. At Sense Corp we define Data Governance as “the exercise of authority and control (planning, monitoring, and enforcement) over the management of data assets.”
The framework you see on the screen here represents the various categories that must be considered in order to make any governance effort successful. (read out all of them)
But probably the best way to understand governance is through a real-life example of something that happened back in 1969 in Cleveland, Ohio.
For decades in the first half of the 20th century, industrial waste and sewage regularly poured into the Cuyahoga (KAI-A-HOGA) River and residents accepted it as a consequence of the city’s prosperity. But in the 1960s, mindsets started to shift as the population became more environmentally conscious. In the next decade, citizens demands that governance over our natural resources be enacted.
How did they do this?
After decades of river fires that would burn bridges, boats, and buildings along the shore, citizens demanded change.
The Cleveland mayor (acting as a voice of leadership) testified before the US Congress. This led to the formation of the Environmental Protections Agency (EPA), which in part led to the passage of the Clean Water Act.
From a governance perspective, Governing bodies were created with authority to tackle the problem. The Clean Water Act was a statute that called for policies and standards. The clean-up was funded through local bonds and federal monies. If you think about the people, the agencies, and the controls, put in place…this is what governance looks like. And these concepts are what we apply to data today to ensure that our data lakes,, rivers, and streams can stay clean and usable for everyone.
*** Data Governance is not a project or a program; it’s a core business function that is necessary in order to compete in the 21st-century business climate.***
Just as water needs to go through a comprehensive set of water quality checks before being consumed, data needs to go through data quality checks before being used.
There are 6 keys dimension in which Data Quality should be assessed. The first is completeness…is all the data available? What about consistency? Can we match data across sources or datasets? We need to look at uniqueness…is there a single definition of that data? What about Validity…does the data match the rules? You can’t have someone in the system whos age is 200…we know that’s not possible in the real world so why should that be allowed in your systems? The theres Accuracy…is the data correct? And lastly, timeliness…is the data available when needed? All 6 coupled together make up your Data Quality.
And many different issues contribute to the quality of your data. Some key contributors are source system issues, data input errors, redundant data, inconsistent usage, and lack of data auditing, all of which can be improved upon with policies and processes set forth in Data Governance. So you can see how it is all inner connected.
And furthermore, a lot of times today you will see Data Quality lumped in with Master Data Management, and that is because a lot of the MDM technology offerings provide Data profiling and data quality tools inside of their offering, but Data Quality indeed is considered a separate discipline. So what specifically is Master Data Management and where does it differ from data quality?
It is a technology driven discipline that allows companies to accurately combine data from multiple data sources; it is used to create the master definition for data domains and to drive consistent use of high integrity data across the company.
Imagine all of the different data sources used at your company to bring data in. You can have data from an ERP system, from a CRM system, and maybe even a claims systems all representing a single customer in three different systems. With data being captured in different ways, there is inevitably going to be some differences, maybe the person has recently moved so their address is different across systems, or maybe they have a nickname they go by which they put in one system but not the other. MDM is the process of applying rules to determine the golden record to ensure alignment around common use of data.
And to bring it full circle, Data Quality and MDM can only be successful when operating under a well implemented Data Governance program.
So why is Data Governance so important in the Age of AI?
Firstly it saves time down the road, when building a predictive model, data scientists spend most of their time cleaning and identifying data to use or profiling their data. Imagine having clean data, all accessible in one place, cataloged nicely and ready for you to use. The time savings here would be tremendous.
Secondly, we’ve all heard this before, but put garbage into your model and you will get garbage out. The worse the quality of the data you train with, the worse the results of your AI. AI projects shouldn’t even be started until you know you have good data, as good data in leads to great decisions out.
And Lastly, a big topic in the AI community right now is creating trust with our models and practicing ethical AI. With Data Governance in place, the privacy of your data being used in these models, along with the fairness of the model can be assured as data governance aids in the transparency around the collection, use and storage of the data as well as minimizing the bias in the data being circulated to those across the organization.
So overall, bad data = bad everything. It effects the bottom line and effects your ability to make accurate decisions. 88% of companies report that inaccurate data had a direct impact on their bottom line, with 12% reporting lost revenue for the average company because of inaccurate data, and not to mention 42% of managers recognized that they have made wrong decisions using bad data. Think about the 1-10-100 rule of clean data….if you had a $1 prevention cost at the point of capture, that would turn into a $10 correction cost downstream if not caught, and would balloon into $100 failure cost at the time of the decision. So although it’s a cheap cost upstream, downstream it compounds! So moral of the story is, put in the work upfront to make sure your data is clean and accessible for all those in the organization.
Now we are going to look at how a lack of business readiness can contributes to data science project failing
Whenever we think about a transformation we think in terms of the people, process, and technology within that transformation. In this transformation towards AI though, we are seeing companies rarely lack data, tools, or things that fit in the technology bucket. There is a plethora of data out there and many open source tools available to start analyzing your data. What most organizations are lacking is centered around a people and process domain. Correctly structuring a data science team within your organization is a huge step that needs to be taken by an analytics leader to enable a data driven culture and help the company realize the full potential of analytical capabilities.
What’s even more interesting is that not only does setting up an organizational strategy for data science help secure a spot for data science to grow and flourish inside the organization, but it also helps teams surrounding the data science team in learning how to interact with Data Scientists as they currently don’t know how. Data Scientists have very desirable skill sets. They know how to program , they know how to visualize and analyze data, as well as build predictive and statistical models. Due to their knowledge across multiple domains, they often get pinged and pulled to put out fires, resulting in data science initiatives getting thrown to the back burner, instead of working through a deliberate project scoped out by the business teams.
Lets take a look at the 3 main types of Data Science strategies organizations are using to set up their data science teams for success.
The first is a Decentralized strategy - think Finance vs. Sales vs. Product vs. Customer Success, each with their own analytics teams dedicated to and embedded within the function. Some cons of this are that you will have to move and transform data between applications, potentially be doing duplicated work, and working in more of a reactive manner, when they see a problem then they tackle it. The benefits though are that its easy to build subject matter expertise within that area and the analytics functions are closely aligned with the business, issues, and customers. This organization arises commonly in larger organizations where data science initiatives have arisen organically in multiple parts of the business.
Now lets jump to the other end of the spectrum and look at Centralized strategy- all quantitative analysts, data engineers and data scientists would report into a central analytics hierarchy, with responsibilities spanning the organization. This is very common and what you may have seen branded as a COE or a center of excellence. Time and resources are managed within that unit to develop technical expertise and modeling capabilities, as opposed to minimizing response time between business question and answer. It’s a very proactive approach. The benefits are shared services, processes, tools and methodologies and being better positioned for long term innovation. Centralized functions can work well in analytically mature organizations, with the time, patience and money to fund what is essentially an internal research capability. The cons are it requires a large commitment and investment to empower fast and effective organizational adoption, and building subject matter expertise take a lot more effort.
Lastly we will look at what falls in between these two spectrums which is a Semi-Centralized strategy - Like a centralized structure, a single organizational data science leadership team sets the organizational data science strategy. Its management team serves as functional managers to hire, develop, and promote data scientists. Sister (or embedded) teams of engineers enable production deployment. However, the data scientists are assigned to (and might even sit with) various business units and focus on the same domain-specific problems. Breath of knowledge can be gained by rotating data scientists among the various centralized sub-teams. In short, the organization gets a centralized infrastructure, a common data science strategy, and effective talent management, and the business units get somewhat dedicated teams who are knowledgeable about their specific needs.
Every organization is at a different part in the journey so there is no right or wrong answer to setting up your data science organizational strategy. They key is to pick a strategy and educate those in the organization how to adopt to that strategy.
The other aspect that is folded into a lack of business readiness is making sure that you defining achievable use cases for your data science teams. This happens in 3 simple steps….firstly you need to list out all the potential use cases. This is the easiest part, there are no guidelines besides it just has to be a question that can be answered using data…and it doesn’t necessarily have to be a straight forward answer either, it can be that you are looking for an explanation or validation. I want to caution you when thinking of these use cases to steer away from things only YOU know about or things YOU may think are impossible. Think of it like an idea brain storming session, throw everything out there and see what sticks, its important to have team members from a diverse background in these discussions, instead of just people from one business unit or expertise. Next is evaluating the use cases. Ill show you a blown up example of this in the next slide, but think of creating a graph with an x and a y axis. On the x axis you have business value and on the y axis you have the level of effort or technical feasibility…..look at the uses cases and plot them on the graph to see where they fall. Visualizing in this way makes it really easy to drive out our last step which is prioritizing the use cases. Now you can see the ones you should tackle first, which are those occupying the high technical feasibility with a High business value space.
So those in the top right corner are the use cases we drove out first to give us a quick win. To enable data science across the organization its better to start with something small that drives business value, than to aim really high and fail, then you give the perception across the organization that data science projects are risky, take a long time to complete, and aren’t even successful. By aiming for the more attainable use cases, you are showing success to get the ball rolling, all the while you are still developing your talent and investing in technologies so that down the line in the future you will be ready to tackle the bigger ones highlighted in red. Its also really important to note that this isnt something that is static either, as you invest in new technologies and your talent grows you can always choose to add more use cases and then reevaluate and re-prioritize according to your current business climate. Its an ever changing cycle that must be iterated upon.
The number one factor contributing to making operationalization unreachable is centered around being able to identify that these two concepts–building machine learning vs. scaling machine learning…are two different set of problems that each have their own set of solutions. This plays a key role in why data science projects are failing. A lot of companies are just aiming to build models, which is a great place to start, but if you want your data science projects to be successful for the long term and integrated into the business, you need to make sure that once they are built, that they can be scaled.
Think about when you are building a model, you are normally running the model on your computer in a jupyter notebook. What happens when these models need to go into production and run in real time? Surely what you were building on your local computer will break when scaled into production. Models in production should be running automatically, on a platform that has huge processing power. They should be checked regularly for model drift or to see if the model has become stale through an automated process. These are all considerations you didn’t even need to think about when you were building the model on your machine because there you weren’t deploying anything, they were being run on command and only validated against other models manually.
Creating and carrying out a plan to transition the models you built into production is vital if you want the project to succeed.
But this isn’t the only arena that operationalization is composed of. There is a process side and a technology side. And the process side is the one that deals with model deployment and model management, but in order to drive that out you need to invest in the proper technology . Before we dive into what specific technologies you should be investing in, lets take a step back and first understand at a high level the big buckets that we need think about in order to achieve operationalization from a technological standpoint.
Storage is the first bucket. Everybody knows the volume of data is growing at a compounding rate, they say by 2025 worldwide data is expected to hit 175 zettabytes (10,000 TB)! So we need somewhere to store all this structured and unstructured data, preferably in space that has room to grow.
Secondly, as obvious as it sounds we need robust data. As we learned earlier that data cannot be siloed, inaccurate, or slow …. so we need to make sure we have the proper processes in place to bring data in from multiple systems across the business, even dating back to prior years and making sure that data is readily available and easy to access.
Next is compute. Training models on millions of rows of data is no easy task for your computer, and when these models are running in production you need them to be fast, giving real time results, so processing power is very important and should be a factor to be considered when thinking about technologies you will be adopting.
Lastly the output of your analysis should be taken into consideration. Think about how you want to communicate your findings. Are you going to display a bunch of code to your project stakeholders to convince them your model should be used to make decisions? Not likely, so investing in a tool that can help you visualize your findings is just as important as the other three buckets.
Now using those four big buckets we just outlined, lets walk through the types of technologies that fit into each category. For Storage, you are going to want to invest in a data warehouse that is highly scalable and in the cloud. You get on demand pricing that is affordable for everyone, minimal set up time, and you don’t have to worry about managing the DB infrastructure. Examples of these warehouses would be tools like Amazon Redshift, Google BigQuery, Snowflake, or Azure Synapse.
For achieving the concept of robust data, you are going to want to ensure you have the proper data pipelines in place to bring your data to users across the organization in a timely manner. This powers your warehouses and is especially important when giving real time predictions and analysis on streaming data. Examples of tools you would invest in for this space are Apache Kafka, Airflow, Confluent, Spark, Python, and Rest APIs.
Once we have the data available to us for modeling, we need some Analytical tools or platforms to help us process all the data and train or build our models. These tools can even be looked at as a framework for the entire life cycle of a data science project. These would be tools like Python, R, Spark, Anaconda, Databricks, Alteryx, or Domino.
Lastly, is how we want to communicate our findings, and visualization tools are the main player in this arena that aide business stakeholders in making decisions. You should be looking at Tableau, PowerBI, Plotly, D3, and Matplotlib.
LAST POINT:
****These are the core main tool…but definitely not an all inclusive list…**** Video files, text files, geo database files….there are other types of thing you would be bringing in…NOSQL storage, graph databases****
So we’ve touched on the process and the technology aspects of operationalization, but what about the people. I want to call out how important it is to make sure your data scientists are working with data engineering resources to achieve success. As AI continues to evolve, as do the roles that come with implementing data science initiatives. Data engineering is being used to collect the relevant data and build pipelines to move and transform the data to make it available for the data science team. This role can sometimes be filled by data scientists in smaller organizations, or in larger organizations you may see a specific data engineering resource who has a software engineering background, or other times it is being fulfilled by the IT department.
The distinction here is that data scientists may still have to transform the data to fit into their models, but they are mainly analyzing the data using statistical methods to draw insights, leaving the data engineering to other resources who are experienced in that arena.
But although they are distinct roles, the data engineering resources must work closely with the data scientists to streamline capabilities. Asking a data scientist to build a data pipeline is at the far edge of their skills, mean while it’s the bread and butter of the data engineering resource. Data engineering resources use their programming and systems creation skills to create big data pipelines, while Data scientists use their more limited programming skills and apply their advanced math skills to create advanced data products using those existing data pipelines. This difference between creating and using lies at the core of a team’s failure with big data. A team that expects their data scientists to create the data pipelines will be gravely disappointed.
Don’t elaborate. Reference our E-book, interop presentation some overlap..another one coming up subscribe… dive deep into a couple of use cases and why they are successful and how AI applies.
Special peek into our upcoming webinar….Small Investments, Big Returns: Three Successful Data Science Use Cases….which will be Sept 17, so be on the lookout. It will be going over multiple client use cases where we have come in and helped them at a specific part in their Journey or throughout the entirety of their Journey. No journey is alike, and neither is the timeline of climbing towards full AI adoption. The projects range from the manufacturing industry, to the oil and gas industry, and even to the education industry. You wont want to miss it.
I very much appreciate your time today, and I look forward to connecting with you all again in the future. If you have any questions please feel free to ask them now and Kelly will help facilitate them.