With recent studies indicating that 80% of AI and machine learning projects are failing due to data quality related issues, it’s critical to think holistically about this fact. This is not a simple topic – issues in data quality can occur throughout from starting the project through to model implementation and usage.
View this webinar on-demand, where we start with four foundational data steps to get our AI and ML projects grounded and underway, specifically:
• Framing the business problem
• Identifying the “right” data to collect and work with
• Establishing baselines of data quality through data profiling and business rules
• Assessing fitness for purpose for training and evaluating the subsequent models and algorithms
Your AI and ML Projects Are Failing – Key Steps to Get Them Back on Track
1. Your AI and ML Projects Are Failing
Key Steps to Get Them Back on Track
Harald Smith, Director Product Marketing
2. Housekeeping
Webcast Audio
• Today’s webcast audio is streamed through your computer speakers.
• If you need technical assistance with the web interface or audio,
please reach out to us using the chat window.
Questions Welcome
• Submit your questions at any time during the presentation
using the chat window.
• We will answer them during our Q&A session following the
presentation.
Recording and slides
• This webcast is being recorded. You will receive an
email following the webcast with a link to download
both the recording and the slides.
2
3. Speaker
Harald Smith
• Director of Product Marketing, Syncsort
• 20+ years in Information Management with a
focus on data quality, integration, and governance
• Co-author of Patterns of Information Management
• Author of two Redbooks on Information Governance
and Data Integration
• Blog author: “Data Democratized”
3
4. AI/ML needs
Data Quality
The importance of data quality
in the enterprise:
35%of senior executives
have a high level of trust
in the accuracy of their
Big Data Analytics
KPMG 2016 Global CEO Outlook
92%of executives are concerned
about the negative impact of
data and analytics on
corporate reputation
KPMG 2017 Global CEO Outlook
80%of AI/ML projects are stalling
due to poor data quality
Dimensional Research, 2019
“Societal trust in business
is arguably at an all-time
low and, in a world
increasingly driven by
data and technology,
reputations and brands are
ever harder to protect.”
• Decision making
• Customer centricity
• Compliance
• Machine learning & AI
4 EY “Trust in Data and Why it Matters”, 2017
Only
5. “
”
The magic of machine learning is that you
build a statistical model based on the most
valid dataset for the domain of interest.
If the data is junk, then you’ll be building a junk
model that will not be able to do its job.
James Kobeilus
SiliconANGLE Wikibon
Lead Analyst for Data Science, Deep Learning, App Development
2018
6. 1
Key steps to improve Data Quality for AI/ML
Identify the
“right” data to
collect and work
with
Establish baselines
of data quality
through data
profiling and
business rules
Assess and
communicate the
fitness for purpose
of the data for
training and
evaluating the
subsequent models
and algorithms
6
Four foundational data steps to get or keep your AI and ML projects grounded and underway:
Frame the
business problem
2 3 4
9. Universal DQ
best practices
Understand the End Goal
• How does the business intend to use
the data (i.e. what’s the use case)?
• Empower users (“Who”) to gain new
clarity into the core problem (“Why”)
• What will the data be used for?
• What defines the Fitness for
your Purpose?
Establish Scope
• Ask the “right questions” about
the use case and the data (not just
“what” and “how”)
• What data is relevant to the effort?
• Big Data or other, you need to set
boundaries for the work
Understand Context
• How does the business define
the data?
• What are the important
characteristics and context
of the data?
• What are the Critical Data
Elements?
• What qualities will you need
to address, or leave alone?
• “High-quality data” definition
will vary by business problem
“If you don’t know what you want
to get out of the data, how can you
know what data you need – and what
insight you’re looking for?”
Wolf Ruzicka, Chairman of the Board at EastBanc
Technologies, Blog post: June 1, 2017,
“Grow A Data Tree Out Of The “Big Data” Swamp”
“Never lead with a data set;
lead with a question.”
Anthony Scriffignano, Chief Data Scientist,
Dun & Bradstreet, Forbes Insights, May 31, 2017,
“The Data Differentiator”
9
11. What’s the
“Right” Data?
Is relevant and specific for
the business problem
Is free from bias and
assumptions
Supports hypothesis testing
Ask questions about the data you expect
you need
Understand the Provenance of the data
• Who produced it, when did they
produce it, and why?
• Has it been transformed or
changed from original (lineage)?
Understand whether the data is
Comprehensive
• What is the scope of the data?
• What data is missing?
• Are approaches available to
identify/capture what is missing?
Understand the “universe” of Relevant
data
• Consider sources within and
outside the organization
Understand whether the data is Timely
• How can you be certain the
data is truly current?
Understand additional challenges
obtaining data, both for evaluation and
operational use
11
12. Comprehensiveness depends on the
business context/question
• Customer Engagement/Loyalty
• Known customers, both active & inactive
• New Customer Campaigns
• “Active” consumers, both known and
unknown
• Fraud Detection
• Any known or unknown person
impersonating a customer or prospect
Ask/understand what the “Unknown and/or
Unavailable” represents
• Why does this segment exist?
• If relevant, can the characteristics be inferred
through other data?
• Is there inherent bias in leaving this group out?
Comprehensive: a “Customer” example
Unknown & Active
• Prospect
• Data in CRM? Website
visits? Store visits? Prospect
lists?
Known & Active
• Customer
• Data in MDM/DW
• What about Call Center?
CRM? Website visits? Store
visits? Loyalty Program?
Unknown and/or unavailable
• Not a customer
• No data? Or is data
available through other
means?
Known & Inactive
• Former Customer
• Data in MDM/DW?
• What about Call Center?
CRM? Website visits? Store
visits? Loyalty Program?
12
13. Relevance for additional data depends on the
business context/questions
• Customer Engagement/Loyalty
• Website, Call Center, Social Media, Location,
Store Data, Demographics
• New Customer Campaigns
• Location, Demographics, Website, Social Media,
Prospect Lists
• Optimal Shipping/Delivery
• Location, Weather, Store Data
• Fraud Detection
• IP Address, Device ID, Purchase Location, etc.
Additional content from both internal and external
sources may be relevant if within a useful time period
• Change of Address, Suppression lists, etc.
Relevant:
a “Customer” example
“Customer”
Location Demographics
Social
Media
Website,
Call Center,
Store, etc.
Other:
Weather,
Prospect
Lists, etc.
Order
Transactions
Call
Transcriptions
Product/
Service
Reviews
Abandoned
Carts
Census
Data
Credit
History
13
14. 1. Lack of data, or scattered and difficult to access datasets
• Little or no accessible data; or necessary data trapped in mainframes, operational systems, or streams.
• Data typically stored in incompatible formats.
• Other data must be acquired, appended, or transformed for use.
2. Data standardization, cleansing, and enrichment at scale
• Data needs to be tagged, classified, standardized, and normalized.
• Data quality standardization, cleansing, enrichment, and preparation needs to be applied consistently and reproduced at scale.
3. Entity resolution and customer identification
• Distinguishing single entity matches across massive datasets requires sophisticated multi-pass, multi-field matching algorithms.
• Continuous cross-comparison and resolution needs to occur as new data arrives.
4. Need for near real-time current data
• Tracking and detection needs to happen very rapidly.
• Current transactions constantly added to combined datasets and presented to models as close to real-time as possible.
5. Tracking lineage from the source
• Data changes made to help train models have to be exactly duplicated in production.
• Capture of complete lineage, from source to end point is needed.
Five further challenges to enable Machine Learning
14
16. Data Quality challenges with Machine Learning
Incorrect, incomplete, mis-formatted, and sparse “dirty data”
• Mistakes and errors are rarely the patterns you are looking for.
• Sparse data generates other issues or may be ignored as “noise”.
• Correcting and standardizing data boosts the signal, but can increase bias.
Missing context
• Insufficient information about customer and location data can make many
ML algorithms unusable.
• Enriching data increases context, but choice of source can skew/bias result.
Duplicates and multiple copies
• Many sources can yield multiple records about the same person, company,
product or other entity, skewing the signal and outcomes.
• Removing duplicates enhances the overall depth and accuracy about a
single entity, but must watch for over- or undermatching of data.
Spurious correlations
• Inclusion of already correlated data (e.g. city and postal code) may result in
overfitting of ML algorithms or ‘false’ discoveries.
Correcting data problems vastly increases a data set’s usefulness for machine learning.
But data analysts may not be aware of
specific data quality issues that must be
addressed to support machine learning.
Traditional data quality processes are
an effective method to identify defects.
!CAUTION
16
17. Understand Context
• What Critical Data Elements and other attributes are relevant?
• What qualities need to be addressed, or left alone?
• When, and where, do we need to transform or enrich the data content?
• How are we connecting, relating, or combining data?
Develop, Test, and Deploy Corrective Measures
• Consistent application of standardization, transformation, enrichment,
and entity resolution
• Common templates, rules, metrics, and processes that can be leveraged
• Validation and measurement after corrective measures applied
• Deploy into batch, real-time, or embedded services
Apply Data Governance
• Implement metrics and measures for ongoing assessment and evaluation
• Establish baselines for ongoing comparison/evaluation
• Continue to iterate throughout data preparation and model testing
Data Quality best practices
17
18. Tools for
DQ analysis
Data Profiling
The set of analytical techniques
that evaluate actual data content
(vs. metadata) to provide a
complete view of each data
element in a data source.
Provides summarized inferences,
and details of value and pattern
frequencies to quickly gain data
insights.
Business Rules
The data quality or validation rules
that help ensure that data is “fit for
use” in its intended operational
and decision-making contexts.
Assess the dimensions of data
quality: accuracy, completeness,
consistency, relevance, timeliness,
& validity of data.
18
19. Common Data Quality measurements
What measures can we take advantage of?
1. Completeness – Are the relevant fields populated?
2. Integrity – Does the data maintain an internal structural integrity
or a relational integrity across sources
3. Uniqueness – Are keys or records unique?
4. Validity – Does the data have the correct values?
• Code and reference values
• Valid ranges
• Valid value combinations
5. Consistency – Is the data at consistent levels of
aggregation or does it have consistent valid
values over time?
6. Timeliness – Did the data arrive in
a time period that makes it
useful or usable?
19
20. New data, new data quality challenges
• 3rd Party and external data with unknown provenance, timeliness, or
relevance
• Bias in the data – whether in collection, extraction, or other processing
• Data without standardized structure or formatting
• Continuously streaming data
• Disjointed data (e.g. gaps in comprehensiveness or receipt)
• Consistency and verification of data sources (e.g. was the origination
verified?)
• Changes and transformation applied to data (i.e. does it really represent the
original input)
New Data Quality problems
“34 percent of bankers in our survey report that their
organization has been the target of adversarial AI at least
once, and 78 percent believe automated systems create
new risks, such as fake data, external data manipulation,
and inherent bias.””
Accenture Banking Technology Vision 2018
20
22. Work within the defined Business Frame!
• Reconfirm the business purpose and context
• Review the data attributes deemed critical and the criteria that required
validation
Test and validate data for identified DQ measurements
• Apply data profiling and established business rules
• Establish baselines!
• Evaluate and determine necessary actions/remediate issues
• Take action on incorrect data and defaults
• Create flags for subsequent use in marking or remediating data
Annotate what you’ve found
• Identify each attribute/criteria and annotate all issues
• Utilize flags, tags, and other indicators to help others distinguish the
type and severity of issues
Establish, document, and present Fitness for Purpose
Iterate for all data in use, as well as model validation
Assess Fitness for Purpose
22
23. Culture of Data Literacy
“Democratization of Data” requires cultural support
• Empowered to ask questions about the data
• Trained to understand the business context and use of data
• Trained to understand approaching and evaluating data quality
• Traditional data, new data, machine learning requirements, …
• Empowered to prove/reject hypotheses
Program of Data Governance
• Provide the processes and practices necessary for success
• Measure, monitor, and improve
• Continous iteration and development
• Communicate what you’ve discovered! (and where others can find!)
Center of Excellence/Knowledge Base
• Where do you go to find answers?
• Who can help show you how?
Communicate!
23
24. Summary
Keep AI/ML projects focused
It is challenging to keep the
business frame/value in mind!
• Data comes from multiple
disparate systems & sources
• The business context may not
be obvious based on data alone
• There is a higher demand and
expectation for seeing data
quality in context.
• You need to assess and measure
the data content to establish
both baselines and common
understanding
4 Key Steps
1. Remember the end goal – ask
questions, use best practices, and
establish scope & context
2. Consider what data is needed
• Focus your attention based
on the type of data and the
use case
• Consider how you can ensure
data is comprehensive,
relevant, and useful
3. Test rules to validate data quality,
establish baselines, communicate
findings, and build trust!
4. Assess and communicate fitness
for purpose
Gaining insight and
measurement of
data quality is more
critical than ever!
24
25. Further Resources
• Data Profiling: The First Step to Big Data Quality
• Emerging Data Quality Trends for Governing and Analyzing Big Data
• Introducing Trillium DQ for Big Data: Powerful Profiling and Data
Quality for the Data Lake
harald.smith@syncsort.com