SlideShare ist ein Scribd-Unternehmen logo
1 von 46
Downloaden Sie, um offline zu lesen
Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 0
Tobias Pentek
Martin Fadler
Competence Center
Corporate Data Quality
(CC CDQ)
Machine learning techniques to improve data
management and data quality
Taxonomy, archetypes and case studies
Marcus Evans, Amsterdam, 08.02.2019
Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 1
Agenda
AI/ ML’s potential in data management2
Research process and outcomes3
Machine learning technique archetypes in detail4
Conclusion and outlook 20195
Data management for the digital enterprise1
Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 2
Agenda
AI/ ML’s potential in data management2
Research process and outcomes3
Machine learning technique archetypes in detail4
Conclusion and outlook 20195
Data management for the digital enterprise1
Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 3
The Competence Center Corporate Data Quality (CC CDQ)
is a research consortium
2006
Foundation
+35
Members
+60
CC CDQ
Workshops
12
PhD
Graduates
+1500
Contacts within
CDQ community
Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 4
Practitioners and researcher jointly develop innovative data
management solutions in the CC CDQ
Current members of the Competence Center Corporate Data Quality
Consortium research is conducted in association between research institutions and companies
Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 5
There are multiple publications from the CC CDQ
for your reference
Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 6
Current research topics cover data catalogs, data lake
governance, data strategies and AI in data management
CC CDQ Research Focus in 2019
Data lake
governance
Data
management
capabilities for
regulatory
compliance
Data quality
and
business
impact
Data
catalogs
AI in data
management
Data
management
rollout to
new
domains
Data
platforms &
ecosystems
Data
strategies
Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 7
Agenda
AI/ ML’s potential in data management2
Research process and outcomes3
Machine learning technique archetypes in detail4
Conclusion and outlook 20195
Data management for the digital enterprise1
Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 8
AI has a long history but major progress has been done
recently due to abundant data and cheap processing power
DATA COMPUTING
POWER
AI/ML
https://www.flickr.com/photos/103454225@N06/9965173654 https://commons.wikimedia.org/wiki/File:Wikimedia_Foundation_Servers-8055_10.jpg
Artificial
Intelligence
Methusalix from Asterix & Obelix
Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 9
Potential for automation with computers is increasing
https://twitter.com/andrewyng/status/788548053745569792?lang=en
Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 10
Machines become creative: First picture generated with
machine learning technique sold at Christies for 432 500 $
A machine was feeded with digital pictures of various artists of the past
centuries. Based on this knowledge, it learned different drawing styles and
painted its own picture.
https://www.theverge.com/2018/10/25/18023266/ai-art-portrait-christies-obvious-sold https://www.christies.com/features/A-collaboration-between-two-artists-one-human-one-a-machine-9332-1.aspx
Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 11
Some old principle holds true “Garbage in, garbage out”
Data quality is among top-management’s
major concerns when thinking about
implementing AI.1
1Pyle and JosĂŠ (2015)
Bad training data Bad resultsAI
Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 12
Machine learning holds great potential to improve data
quality and support data management
https://www.bloomberg.com/professional/blog/machine-learning-plays-critical-role-improving-data-quality/
• Hidden factories: bad data quality costs an enterprise between 15 - 25 % of their annual revenue.1
• Data scientists spent almost 80% of their time collecting, cleaning and organizing data.2
Tamr, a machine learning curation
system, could prove in three real
world enterprise curation problems
that it can lower curation cost by
about 90%.3
1Redman (2017)
2Crowdflower (2016)
3Stonebraker et al (2013)
Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 13
Agenda
AI/ ML’s potential in data management2
Research process and outcomes3
Machine learning technique archetypes in detail4
Conclusion and outlook 20195
Data management for the digital enterprise1
Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 14
CC CDQ research has identified and classified ML
techniques for data management
Roles Data problems
Data quality
impact
Data input
type
Data output
type
Machine
learning
approach
Context Application
ML techniques for
data management
Data collector
Data custodian
Data consumer
Data cleaning
Entity resolution
Data transformation
Data integration
Data provenance
Metadata/ Schema
discovery
Data and metadata
profiling
Data and metadata
profiling
Data archiving
Data enrichment
Data generation
Data monitoring
Intrinsic
Contextual
Representational
Accessibility
Structured
Semi-structured
Unstructured
Structured
Semi-structured
Unstructured
Supervised
Active learning
Unsupervised
Outcome 1:
TAXONOMY
Outcome 2:
11 ARCHETYPES OF
ML TECHNIQUES
FOR DM
Questions:
• How can ML techniques support data management and improve data quality (DQ)?
• What are typical applications of ML techniques in data management?
SCIENTIFIC
RESEARCH
EXPERT
KNOWLEDGE
TOOLS
44
Cases
Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 15
In a first step, we developed a taxonomy for classifying
different ML techniques for data management
ML techniques for
data management
Data
management
context
ML technique
Roles
Data collector
Data custodian
Data consumer
Data problems
Data cleaning
Entity resolution
Data transformation
Data integration
Data provenance
Metadata/ Schema
discovery
Data and metadata
profiling
Data archiving
Data enrichment
Data generation
Data monitoring
Data quality
impact
Intrinsic
Contextual
Representational
Accessibility
Data input
type
Structured
Semi-structured
Unstructured
Data output
type
Structured
Semi-structured
Unstructured
Machine
learning
approach
Supervised
Active learning
Unsupervised
Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 16
In a second step, we analyzed and classified 44 cases
applying ML techniques based on the taxonomy
Source Category Short title 1.1 1.2 1.3 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 3.1 3.2 3.3 3.4 4.1 4.2 4.3 5.1 5.2 5.3 6.1 6.2 6.3
1.1 Experte 1 Expert Data entry: Default values by data entry 1 1 1 1 1 1 1
7 Parikh et al (2010) Research Predict data entry 1 1 1 1 1 1 1
26 Grammarly Tool Correction of syntactic and semantic errors in text 1 1 1 1 1 1
2 Experte 2 Expert Detect material data from pictures 1 1 1 1 1 1 1
5 Sarawagi et al (2001) Research Extract data object from text 1 1 1 1 1 1 1
6 Knoblock et al (2003) Research Learn to extract data from web 1 1 1 1 1 1 1
9 Hu et al (2017) Research Extract data object from text 1 1 1 1 1 1
18 Wu et al (2016) Research Google’s Neural Machine Translation System 1 1 1 1 1 1
29 bonobo.ai Tool Retrieve insights from customer support calls, texts and other interactions 1 1 1 1 1 1 1 1
34 DeepL Tool DeepL translation service 1 1 1 1 1 1 1
8 Hui Han et al (2003) Research Extract metadata from documents 1 1 1 1 1 1 1
1.2 Experte 1 Expert Data enrichment: Forecast processing time 1 1 1 1 1 1 1
1.3 Experte 1 Expert Data enrichment: Assign product code from description 1 1 1 1 1 1 1
3 Experte 3 Expert Predicting the tariff code of a material master 1 1 1 1 1 1 1
28 commercetools Tool Automatically categorize products with machine learning 1 1 1 1 1 1 1
32.2 Reltio Tool (2) Machine learning-assisted data enrichment 1 1 1 1 1 1 1 1
25 AX semantics Tool Generation of product descriptions 1 1 1 1 1 1 1
27 Flixstock Tool Generation of catalog pictures using semantic segmentation 1 1 1 1 1 1
10 Volkovs et al (2014) Research Continuous Data Cleaning 1 1 1 1 1 1 1
15 Pit--Claudel et al (2016) Research Outlier Detection in Heterogeneous Datasets 1 1 1 1 1 1
16 Yakout et al (2011) Research Guided Data Repair 1 1 1 1 1 1 1
14 Bhamidipaty and
Sarawagi (2002)
Research Interactive Deduplication using Active Learning
1 1 1 1 1 1
17.2 Stonebraker et al (2013) Research Data curation system (2): entity resolution 1 1 1 1 1 1
31 Talend Tool Using machine learning for data matching 1 1 1 1 1 1
32.1 Reltio Tool (1) Machine learning-assisted data matching 1 1 1 1 1 1
12 Liu et al (2000) Research Database Integration using Neural Networks 1 1 1 1 1 1 1
13 Halevy et al (2002) Research Learn mappings between ontologies 1 1 1 1 1 1
17.1 Stonebraker et al (2013) Research Data curation system (1): schema integration 1 1 1 1 1 1
21 Berlin and Motro (2002) Research Schema matching using machine learning 1 1 1 1 1 1
22 Doan et al (2001) Research Schema matching using machine learning 1 1 1 1 1 1
23 Eckert et al (2009) Research Improving ontology matching using meta-level learning 1 1 1 1 1 1
24 Shi et al (2009) Research Actively learning ontology matching via user interaction 1 1 1 1 1 1
36 Octopai Tool Discover metadata and trace data across systems (data lineage) 1 1 1 1 1 1 1 1 1
33.1 Amazon macie Tool Automatically discover, classify, and protect sensitive data in AWS 1 1 1 1 1 1 1
35.2 pingar Discoveryone Tool (2) Detect sensitive data across systems 1 1 1 1 1 1 1 1
4.1 Experte 1 Expert Business rules mining (1): extract and cluster rules 1 1 1 1 1 1
11 Hipp et al (2001) Research Data quality mining: Association rules learning 1 1 1 1 1 1
19 Leser et al (2009) Research A Machine Learning Approach to Foreign Key Discovery 1 1 1 1 1 1
35.1 pingar Discoveryone Tool (1) Detect data to retire 1 1 1 1 1 1 1
4.2 Experte 1 Expert Business rules mining (2): predict interestingness from previous ratings 1 1 1 1 1 1
33.2 Amazon macie Tool Detect unauthorized access and avoid data leak 1 1 1 1 1 1 1
20 Fernandez et al (2018) Research Linking Datasets using Word Embeddings for Data Discovery 1 1 1 1 1 1 1 1 1
30 Alation Tool Recommendation of tables joins 1 1 1 1 1 1 1
35.3 pingar Discoveryone Tool (3) Discover Metadata (named entities, keyphrases, pre-defined categories) 1 1 1 1 1 1 1 1
ID
1 - Roles 2 - Data problems 3 - DQ impact
4 - Data
input type
5 - Data
output type
6- ML
approach
ML Technique
11 typical applications of ML
techniques in data
management (“archetypes”)
have been identified
44 cases applying ML
techniques in data
management:
• 7 from experts
• 21 from tool vendors
• 16 from research
Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 17
The results will be published in a whitepaper in Q1/2019
Publication in Q1/2019
Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 18
Agenda
AI/ ML’s potential in data management2
Research process and outcomes3
Machine learning technique archetypes in detail4
Conclusion and outlook 20195
Data management for the digital enterprise1
Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 19
ML supports all data management phases
Acquire &
create
Unify &
maintain
Protect &
retire
Discover
& use
Data collector
Data producer
Data
custodian
Data
protection
officer
MACHINE LEARNING
Data
consumer
Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 20
ML supports all data management phases
Acquire &
create
Unify &
maintain
Protect &
retire
Discover
& use
Data collector
Data producer
Data
custodian
Data
protection
officer
MACHINE LEARNING
Data
consumer
Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 21
Acquire and clean:
ML-assisted data creation and enrichment
Data collector
Data producer
AUTOMATION BENEFITS
• auto-filling values in forms
• automatic extraction of data
from unstructured data
• assignment of attributes/ values
• »First-time-right« principle
• Increased efficiency in
data entry and creation
LEARNING
• Data entry patterns
• Data incidents
• Data extraction patterns
• Data creation patterns
Data pain points:
• manual effort for data creation
• wrong or invalid data entries
(typos, blank fields, errors)
Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 22
Case study Robert Bosch GmbH:
Assignment of correct custom tariffs supported with ML
Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 23
Case study Robert Bosch GmbH:
Assignment of correct custom tariffs supported with ML
Problem:
• Customs tariff numbers (“commodity code”) are
a binding legal requirement.
• Consequences of a wrong code can be significant,
e.g. a delayed customs clearance or severe penalties.
Initial situation:
• Assignment of commodity codes is implemented at Bosch as global uniform process
• Central team of highly qualified experts (Center of Expertise; CoE) assigns the correct codes
on GTS within hours, but is limited in its capacities
• 200.000 classified material masters with 200 different commodity codes for 26 countries and
the European Union
• 70% of Bosch’s business units are covered by this global classification process and increase
in classification requests is expected
CDQ Good Practice Award 2018 submission:
https://www.cc-cdq.ch/sites/default/files/cdq_award/CDQ%20Good%20Practice%20Award%202018_Robert%20Bosch.pdf
Example code:
Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 24
Case study Robert Bosch GmbH:
Assignment of correct custom tariffs supported with ML
CDQ Good Practice Award 2018 submission:
https://www.cc-cdq.ch/sites/default/files/cdq_award/CDQ%20Good%20Practice%20Award%202018_Robert%20Bosch.pdf
Supervised machine learning to learn the relationship
between features and a label
Training based on a set of 11.000 and 50.000 unique
instances, both having about 200 different labels.
The ML solution enables an automated assignment of
commodity codes with high accuracy (90%).
Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 25
ML supports all data management phases
Acquire &
create
Unify &
maintain
Protect &
retire
Discover
& use
Data collector
Data producer
Data
custodian
Data
protection
officer
MACHINE LEARNING
Data
consumer
Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 26
Unify and maintain:
ML-assisted data maintenance and unification
Data custodian
• Reactive: data correction
• Proactive: business rules
• Data unification
• Increased efficiency of
(reactive/proactive) data
maintenance and
unification processes
• Data repairing patterns
• Association rules
• Outlier detection
• Semantic mapping
Data pain points:
• Correction of data errors (reactive)
• Lacking expert know-how for definition
of business rules (proactive)
• Data integration from multiple systems
(redundancies, inconsistencies)
BENEFITSLEARNING AUTOMATION
Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 27
Case study tamr:
Data Curation at Scale: The Data Tamer System
Tamer system (later tamr company) reduced data curation costs
by 90% in three real world examples:
Web aggregator:
• Federation into a semantically cohesive collection of facts of 80.000 URLs with 13 Million
records and 200k local attributes in total
Biology application:
• Integrating lab reports of 8000 biologists and chemists
• Each spreadsheet has ca 1 million rows and at total of 100k attribute names
Health service:
• Deduplication and aggregation of an integrated database of claim records of 300
insurance carriers
Stonebraker et al. (2013)
Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 28
Case study tamr:
Data Curation at Scale: The Data Tamer System
Tamr “Technical whitepaper”
Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 29
ML supports all data management phases
Acquire &
create
Unify &
maintain
Protect &
retire
Discover
& use
Data collector
Data producer
Data
custodian
Data
protection
officer
MACHINE LEARNING
Data
consumer
Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 30
Protect and retire:
ML-assisted data protection and retirement
• Identification of sensitive data
• Identification of “end-of-life”
• Reduced risk
• Increased regulatory
compliance
• PII identifiers
• fraudulent data access
behavior
Data pain points:
• No transparency
where personally
identifiable information
(PII) is stored
• Compliance with data
protection regulations
Data protection officer
Data custodian:
- Data steward
- Data manager
- …
BENEFITSLEARNING AUTOMATION
Data pain points:
• Retirement of
data
Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 31
Case Study Corporate Data League:
Identification of natural persons with ML
n European data privacy regulation (GDPR) forces
companies to handle personal information carefully.
n In most supplier and customer databases, some
records represent natural persons (e.g. doctors,
freelancers, or contacts). And e.g. address data may
be considered as personal information.
n Our tools are trained with tons of global forenames,
surnames, legal forms etc., and are able to identify
natural person records.
Verify & enrich
Our “GDPR Screening” services can also be listed as “technical and
organizational measure” (TOM) for GDPR audits.
Remark
Example
Name Country Category
Castillo Karem PA Legal Entity
Ali Ahmad Ali BASAHI YE Natural Person
Derald Grue US Natural Person
EXPERTISES GALTIER FR Legal Entity
JesĂşs GarcĂ­a Pastor ES Natural Person
Landwirtschaftsbetrieb Thomas Frahm DE Legal Entity
UFK po Respublike Karelia RU Legal Entity
Sonesson Ronni SE Natural Person
Eduardo Alves da Cunha BR Natural Person
Shanghai Ju Yang Forging Machine CN Legal Entity
Ecopack Bulgaria AD BG Legal Entity
Neborachek S.I. UA Legal Entity
Alper Ozel TR Natural Person
Hassan Ali Shah PK Natural Person
Amit Kumar Singh IN Natural Person
Rosenborgs Bakeri AS NO Legal Entity
ROBERTO RAMOS ANTON ES Natural Person
FRANCESCO GIUSEPPE GENOVESE IT Natural Person
BORA KECIC AD-specijalni transporti RS Legal Entity
Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 32
Case Study Corporate Data League:
Identification of natural persons with ML
“CDQ AG” → 0
“CDQ GmbH” → 0
“Martin Fadler” → 1 (Natural person)
“Tobias Pentek” → 1 (Natural person)
“Microsoft Deutschland” → 0
“Marcus Evans” → 0
Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 33
Case Study Corporate Data League:
Identification of natural persons with ML
Business
partner data
Identified
natural
persons
Convolutional Neural Network learns
mapping between input and output
Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 34
ML supports all data management phases
Acquire &
create
Unify &
maintain
Protect &
retire
Discover
& use
Data collector
Data producer
Data
custodian
Data
protection
officer
MACHINE LEARNING
Data
consumer
Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 35
Discover and use:
ML-assisted data discovery
• Recommendations
• Linking of datasets
• Increased access to, and
use of, data
• easier interpretability of
data
• Data usage
• Semantic mapping
Data pain points:
• Finding and cleaning relevant
data
• Identify relationships between
data sets
Data consumer, e.g.:
• Data scientist
• Data citizen
• Developer
BENEFITSLEARNING AUTOMATION
Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 36
Alation:
Recommend relevant data, e.g. sql join tables
https://alation.com/product/
Machine learns
which tables are
joined most likely.
AI/ ML
User writes
SQL query in
data catalog
DATA
CATALOG
Previous SQL queries
AI/ ML
proposes
tables to
join.
User joins proposed
table and finds in this
way data much faster.
4
1
3
2
Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 37
Agenda
AI/ ML’s potential in data management2
Research process and outcomes3
Machine learning technique archetypes in detail4
Conclusion and outlook 20195
Data management for the digital enterprise1
Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 38
Until now, overall 11 archetypes of ML techniques for data
management have been identified
Process Archetype Case IDs DQ impact
Acquire and create
Support the manual data entry: Default values and data repair by data entry (Data cleaning) 1.1, 7, 26 Intrinsic and contextual
Extract data from unstructured data: Data object extraction from unstructured data, text data
translation and metadata extraction to create data automatically (Data transformation and
metadata discovery)
2, 5, 6, 9, 18,
29, 34
Intrinsic and contextual
Create data automatically: Data enrichment and generation to automate further processing
and manual creation activities (Data enrichment and generation)
1.2, 1.3, 3, 28,
32.2, 25, 27,
Contextual and
representational
Unify and maintain
Guide data cleaning process: Correct errors and inconsistencies with active learning (Data
cleaning, entity resolution)
10, 14, 15, 16,
17.2, 32.1
Intrinsic and contextual
Learn rules from data: Extract and update the set of quality and business rules to avoid data
errors (Data and metadata profiling)
4.1, 11 Representational
Interpret semantics for data integration: Semantic integration of data tables and ontologies,
also in an interactive manner with active learning (Data integration)
12, 13, 17.1, 21,
22, 23, 24
Representational and
accessibility
Protect and retire
Detect data across systems: Find and trace sensitive data and data that needs to be retired
across systems (Metadata or schema discovery, data provenance)
36, 33.1, 35.1,
35.2
Representational and
accessibility
Detect suspicious data usage: Detect unauthorized access and data leaks from data access
patterns (Data monitoring)
33.2 Representational and
accessibility
Discover and use
Provide context data (Data enrichment) 4.2, 35.3 Representational
Interpret semantics to link data sets: Semantic integration of data (Data integration and data
provenance)
20 Contextual,
representational and
accesibility
Recommend relevant data (Data integration) 30 Contextual and
accessibility
Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 39
The evaluation shows clearly the potential, but also the
early stage of AI in data management
UTILITY
IMPLEMENTATION
Results of survey
8 participants
CC CDQ workshop 26.09.2018
CANDIDATES WITH HIGH POTENTIAL
Likert scala values:
1: Implementation: 1 (Not started), 2 (Evaluating), 3 (Prototyping), 4 (Project), 5 (Operational)
2: Utility: 1 (Very low), 2 (Low), 3 (Moderate), 4 (High), 5 (Very high)
1
2
Most of the identified usage scenarios seem to have a clear relevance and a
utility of high to very high. Still, all of the scenarios are in their beginning of
investigation. Only one scenario made it to a prototype.
W
ork
in
progress
Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 40
The evaluation shows clearly the potential but also the
early stage of AI in data management
Likert scala values:
1 - Implementation: 1 (Not started), 2 (Evaluating),
3 (Prototyping), 4 (Project), 5 (Operational)
2 - Utility: 1 (Very low), 2 (Low), 3 (Moderate),
4 (High), 5 (Very high)
4 - 5
1 - 2
2 - 3
3 - 4
Category No. Usage scenarios Implementation Utility Difference
Acquire and create data
2.5.1a Auto-fill and typo correction by data entry 2.63 4.14 1.52
2.5.1b Data object detection in unstructured data 1.88 3.50 1.63
2.5.1c Text data translation and generation 1.00 3.33 2.33
2.5.1d Chatbots for data entry 1.50 2.86 1.36
Unify and maintain data
2.5.2a Detection and correction of errors and inconsistencies 1.43 4.57 3.14
2.5.2b Rules validation and exten-sion 2.14 4.00 1.86
2.5.2c Data matching 2.00 4.14 2.14
2.5.2d Data Enrichment 1.75 4.14 2.39
Protect and retire data
2.5.3a Sensitive information detec-tion 1.50 3.71 2.21
2.5.3b Identification of data that needs to be retired 1.50 3.57 2.07
Discover and use data
2.5.4a Key word based and semantic search 1.50 3.43 1.93
2.5.4b Chat bot to find data 1.00 3.00 2.00
2.5.4c Semantic integration of data from heterogenous sources 1.13 3.14 2.02
2.5.4d Recommendation of data 1.75 2.71 0.96
W
ork
in
progress
Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 41
Expert Survey “Machine learning techniques for data
management implementation status and future potential”
If you would like to participate in the expert survey and
receive the results, please click here.
Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 42
Learning and implications for data management
Findings
1. Machine learning has significant potential to improve data quality but will at
the same time disrupt how data is managed
2. Highly repetitive and simple cases will be automated by machine but human
needs to intervene in more difficult and complex cases
3. Redesign of work processes required:
- Machine takes over prediction
- Human judges output and confirms
4. New roles and skills required
Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 43
ML in data management –
Outlook on CC CDQ research activities in 2019
n Impact of AI/ML on shared service center and data management processes
n Prototypical implementation of scenarios with high potential
Research
topics
n Ongoing screening of ML techniques
n Update on taxonomy and archetypes
n Survey on utility and future potential of ML techniques
Research
activities
n Provide benchmark datasets for researchIdea
Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 44
www.cdq.ch
CDQ AG
www.cc-cdq.ch
CC CDQ Portal
www.cdq.iwi.unisg.ch
CDQ Academy
www.xing.com/net/cdqm
CC CDQ Community at XING
https://twitter.com/cdq_ag
CDQ at Twitter
https://www.linkedin.com/groups/8137247
CC CDQ Community at LinkedIn
Please reach out to us if you have any further questions or
topics to discuss
Competence Center Corporate Data Quality
martin.fadler@cdq.ch
Research associate
+41 78 405 16 80‬
Competence Center Corporate Data Quality
tobias.pentek@cdq.ch
Head of Community and Innovation
Tobias Pentek
Martin Fadler (Ph.D. cand.)
Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 45
Q&A
Thank you! Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data AnalyticsUtkarsh Sharma
 
Data Science
Data ScienceData Science
Data ScienceAmit Singh
 
Introduction to data analytics
Introduction to data analyticsIntroduction to data analytics
Introduction to data analyticsSSaudia
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceSampath Kumar
 
Data warehouse architecture
Data warehouse architecture Data warehouse architecture
Data warehouse architecture janani thirupathi
 
Overview on Azure Machine Learning
Overview on Azure Machine LearningOverview on Azure Machine Learning
Overview on Azure Machine LearningJames Serra
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceMahir Haque
 
Data Catalog as the Platform for Data Intelligence
Data Catalog as the Platform for Data IntelligenceData Catalog as the Platform for Data Intelligence
Data Catalog as the Platform for Data IntelligenceAlation
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data AnalyticsDr. C.V. Suresh Babu
 
Bias in Artificial Intelligence
Bias in Artificial IntelligenceBias in Artificial Intelligence
Bias in Artificial IntelligenceNeelima Kumar
 
Data modeling for the business
Data modeling for the businessData modeling for the business
Data modeling for the businessChristopher Bradley
 
DAS Slides: Data Quality Best Practices
DAS Slides: Data Quality Best PracticesDAS Slides: Data Quality Best Practices
DAS Slides: Data Quality Best PracticesDATAVERSITY
 
Data Preprocessing || Data Mining
Data Preprocessing || Data MiningData Preprocessing || Data Mining
Data Preprocessing || Data MiningIffat Firozy
 
The ABCs of Treating Data as Product
The ABCs of Treating Data as ProductThe ABCs of Treating Data as Product
The ABCs of Treating Data as ProductDATAVERSITY
 

Was ist angesagt? (20)

Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
 
Data Science
Data ScienceData Science
Data Science
 
Introduction to data analytics
Introduction to data analyticsIntroduction to data analytics
Introduction to data analytics
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Data warehouse architecture
Data warehouse architecture Data warehouse architecture
Data warehouse architecture
 
Overview on Azure Machine Learning
Overview on Azure Machine LearningOverview on Azure Machine Learning
Overview on Azure Machine Learning
 
Data Quality
Data QualityData Quality
Data Quality
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Data Catalog as the Platform for Data Intelligence
Data Catalog as the Platform for Data IntelligenceData Catalog as the Platform for Data Intelligence
Data Catalog as the Platform for Data Intelligence
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
 
Bias in Artificial Intelligence
Bias in Artificial IntelligenceBias in Artificial Intelligence
Bias in Artificial Intelligence
 
Data modeling for the business
Data modeling for the businessData modeling for the business
Data modeling for the business
 
Lecture #01
Lecture #01Lecture #01
Lecture #01
 
DAS Slides: Data Quality Best Practices
DAS Slides: Data Quality Best PracticesDAS Slides: Data Quality Best Practices
DAS Slides: Data Quality Best Practices
 
Data Preprocessing || Data Mining
Data Preprocessing || Data MiningData Preprocessing || Data Mining
Data Preprocessing || Data Mining
 
Business intelligence
Business intelligenceBusiness intelligence
Business intelligence
 
Big data
Big dataBig data
Big data
 
The ABCs of Treating Data as Product
The ABCs of Treating Data as ProductThe ABCs of Treating Data as Product
The ABCs of Treating Data as Product
 

Ähnlich wie Machine learning techniques to improve data management and data quality

Machine learning for data management - Competence Center Corporate Data Quali...
Machine learning for data management - Competence Center Corporate Data Quali...Machine learning for data management - Competence Center Corporate Data Quali...
Machine learning for data management - Competence Center Corporate Data Quali...CDQ - Sharing Data Excellence
 
313 IDS _Course_Introduction_PPT.pptx
313 IDS _Course_Introduction_PPT.pptx313 IDS _Course_Introduction_PPT.pptx
313 IDS _Course_Introduction_PPT.pptxsameernsn1
 
Introduction to Data Science.pdf
Introduction to Data Science.pdfIntroduction to Data Science.pdf
Introduction to Data Science.pdfUniversity of Sindh
 
Hattrick-Simpers MRS Webinar on AI in Materials
Hattrick-Simpers MRS Webinar on AI in MaterialsHattrick-Simpers MRS Webinar on AI in Materials
Hattrick-Simpers MRS Webinar on AI in MaterialsJason Hattrick-Simpers
 
IRJET- Fault Detection and Prediction of Failure using Vibration Analysis
IRJET-	 Fault Detection and Prediction of Failure using Vibration AnalysisIRJET-	 Fault Detection and Prediction of Failure using Vibration Analysis
IRJET- Fault Detection and Prediction of Failure using Vibration AnalysisIRJET Journal
 
Extraction and Retrieval of Web based Content in Web Engineering
Extraction and Retrieval of Web based Content in Web EngineeringExtraction and Retrieval of Web based Content in Web Engineering
Extraction and Retrieval of Web based Content in Web EngineeringIRJET Journal
 
Ontology Tutorial: Semantic Technology for Intelligence, Defense and Security
Ontology Tutorial: Semantic Technology for Intelligence, Defense and SecurityOntology Tutorial: Semantic Technology for Intelligence, Defense and Security
Ontology Tutorial: Semantic Technology for Intelligence, Defense and SecurityBarry Smith
 
IBM Data Analyst Professional Certificate - C01 - W01.pptx
IBM Data Analyst Professional Certificate - C01 - W01.pptxIBM Data Analyst Professional Certificate - C01 - W01.pptx
IBM Data Analyst Professional Certificate - C01 - W01.pptxMOHAMEDAKRAMSADKI
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data AnalyticsJOSEPH FRANCIS
 
The FAIR data movement and 22 Feb 2023.pdf
The FAIR data movement and 22 Feb 2023.pdfThe FAIR data movement and 22 Feb 2023.pdf
The FAIR data movement and 22 Feb 2023.pdfAlan Morrison
 
Mining Social Media Data for Understanding Drugs Usage
Mining Social Media Data for Understanding Drugs  UsageMining Social Media Data for Understanding Drugs  Usage
Mining Social Media Data for Understanding Drugs UsageIRJET Journal
 
Data_Warehousing_and_Data_Minng_Text_Book.pdf
Data_Warehousing_and_Data_Minng_Text_Book.pdfData_Warehousing_and_Data_Minng_Text_Book.pdf
Data_Warehousing_and_Data_Minng_Text_Book.pdfProfPPavanKumar
 
The-Morgan-Kaufmann-Series-in-Data-Management-Systems-Jiawei-Han-Micheline-Ka...
The-Morgan-Kaufmann-Series-in-Data-Management-Systems-Jiawei-Han-Micheline-Ka...The-Morgan-Kaufmann-Series-in-Data-Management-Systems-Jiawei-Han-Micheline-Ka...
The-Morgan-Kaufmann-Series-in-Data-Management-Systems-Jiawei-Han-Micheline-Ka...DrGnaneswariG
 
Rule-based Information Extraction for Airplane Crashes Reports
Rule-based Information Extraction for Airplane Crashes ReportsRule-based Information Extraction for Airplane Crashes Reports
Rule-based Information Extraction for Airplane Crashes ReportsCSCJournals
 
Rule-based Information Extraction for Airplane Crashes Reports
Rule-based Information Extraction for Airplane Crashes ReportsRule-based Information Extraction for Airplane Crashes Reports
Rule-based Information Extraction for Airplane Crashes ReportsCSCJournals
 
Association rule visualization technique
Association rule visualization techniqueAssociation rule visualization technique
Association rule visualization techniquemustafasmart
 
Profiling Linked Open Data
Profiling Linked Open DataProfiling Linked Open Data
Profiling Linked Open DataBlerina Spahiu
 
A SURVEY OF BIG DATA ANALYTICS
A SURVEY OF BIG DATA ANALYTICSA SURVEY OF BIG DATA ANALYTICS
A SURVEY OF BIG DATA ANALYTICSijistjournal
 
Slide 26 sept2017v2
Slide 26 sept2017v2Slide 26 sept2017v2
Slide 26 sept2017v2Faizura Haneem
 

Ähnlich wie Machine learning techniques to improve data management and data quality (20)

Machine learning for data management - Competence Center Corporate Data Quali...
Machine learning for data management - Competence Center Corporate Data Quali...Machine learning for data management - Competence Center Corporate Data Quali...
Machine learning for data management - Competence Center Corporate Data Quali...
 
Machine Learning for Data Management - Scenarios and Outlook
Machine Learning for Data Management - Scenarios and OutlookMachine Learning for Data Management - Scenarios and Outlook
Machine Learning for Data Management - Scenarios and Outlook
 
313 IDS _Course_Introduction_PPT.pptx
313 IDS _Course_Introduction_PPT.pptx313 IDS _Course_Introduction_PPT.pptx
313 IDS _Course_Introduction_PPT.pptx
 
Introduction to Data Science.pdf
Introduction to Data Science.pdfIntroduction to Data Science.pdf
Introduction to Data Science.pdf
 
Hattrick-Simpers MRS Webinar on AI in Materials
Hattrick-Simpers MRS Webinar on AI in MaterialsHattrick-Simpers MRS Webinar on AI in Materials
Hattrick-Simpers MRS Webinar on AI in Materials
 
IRJET- Fault Detection and Prediction of Failure using Vibration Analysis
IRJET-	 Fault Detection and Prediction of Failure using Vibration AnalysisIRJET-	 Fault Detection and Prediction of Failure using Vibration Analysis
IRJET- Fault Detection and Prediction of Failure using Vibration Analysis
 
Extraction and Retrieval of Web based Content in Web Engineering
Extraction and Retrieval of Web based Content in Web EngineeringExtraction and Retrieval of Web based Content in Web Engineering
Extraction and Retrieval of Web based Content in Web Engineering
 
Ontology Tutorial: Semantic Technology for Intelligence, Defense and Security
Ontology Tutorial: Semantic Technology for Intelligence, Defense and SecurityOntology Tutorial: Semantic Technology for Intelligence, Defense and Security
Ontology Tutorial: Semantic Technology for Intelligence, Defense and Security
 
IBM Data Analyst Professional Certificate - C01 - W01.pptx
IBM Data Analyst Professional Certificate - C01 - W01.pptxIBM Data Analyst Professional Certificate - C01 - W01.pptx
IBM Data Analyst Professional Certificate - C01 - W01.pptx
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
The FAIR data movement and 22 Feb 2023.pdf
The FAIR data movement and 22 Feb 2023.pdfThe FAIR data movement and 22 Feb 2023.pdf
The FAIR data movement and 22 Feb 2023.pdf
 
Mining Social Media Data for Understanding Drugs Usage
Mining Social Media Data for Understanding Drugs  UsageMining Social Media Data for Understanding Drugs  Usage
Mining Social Media Data for Understanding Drugs Usage
 
Data_Warehousing_and_Data_Minng_Text_Book.pdf
Data_Warehousing_and_Data_Minng_Text_Book.pdfData_Warehousing_and_Data_Minng_Text_Book.pdf
Data_Warehousing_and_Data_Minng_Text_Book.pdf
 
The-Morgan-Kaufmann-Series-in-Data-Management-Systems-Jiawei-Han-Micheline-Ka...
The-Morgan-Kaufmann-Series-in-Data-Management-Systems-Jiawei-Han-Micheline-Ka...The-Morgan-Kaufmann-Series-in-Data-Management-Systems-Jiawei-Han-Micheline-Ka...
The-Morgan-Kaufmann-Series-in-Data-Management-Systems-Jiawei-Han-Micheline-Ka...
 
Rule-based Information Extraction for Airplane Crashes Reports
Rule-based Information Extraction for Airplane Crashes ReportsRule-based Information Extraction for Airplane Crashes Reports
Rule-based Information Extraction for Airplane Crashes Reports
 
Rule-based Information Extraction for Airplane Crashes Reports
Rule-based Information Extraction for Airplane Crashes ReportsRule-based Information Extraction for Airplane Crashes Reports
Rule-based Information Extraction for Airplane Crashes Reports
 
Association rule visualization technique
Association rule visualization techniqueAssociation rule visualization technique
Association rule visualization technique
 
Profiling Linked Open Data
Profiling Linked Open DataProfiling Linked Open Data
Profiling Linked Open Data
 
A SURVEY OF BIG DATA ANALYTICS
A SURVEY OF BIG DATA ANALYTICSA SURVEY OF BIG DATA ANALYTICS
A SURVEY OF BIG DATA ANALYTICS
 
Slide 26 sept2017v2
Slide 26 sept2017v2Slide 26 sept2017v2
Slide 26 sept2017v2
 

KĂźrzlich hochgeladen

BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 

KĂźrzlich hochgeladen (20)

BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 

Machine learning techniques to improve data management and data quality

  • 1. Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 0 Tobias Pentek Martin Fadler Competence Center Corporate Data Quality (CC CDQ) Machine learning techniques to improve data management and data quality Taxonomy, archetypes and case studies Marcus Evans, Amsterdam, 08.02.2019
  • 2. Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 1 Agenda AI/ ML’s potential in data management2 Research process and outcomes3 Machine learning technique archetypes in detail4 Conclusion and outlook 20195 Data management for the digital enterprise1
  • 3. Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 2 Agenda AI/ ML’s potential in data management2 Research process and outcomes3 Machine learning technique archetypes in detail4 Conclusion and outlook 20195 Data management for the digital enterprise1
  • 4. Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 3 The Competence Center Corporate Data Quality (CC CDQ) is a research consortium 2006 Foundation +35 Members +60 CC CDQ Workshops 12 PhD Graduates +1500 Contacts within CDQ community
  • 5. Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 4 Practitioners and researcher jointly develop innovative data management solutions in the CC CDQ Current members of the Competence Center Corporate Data Quality Consortium research is conducted in association between research institutions and companies
  • 6. Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 5 There are multiple publications from the CC CDQ for your reference
  • 7. Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 6 Current research topics cover data catalogs, data lake governance, data strategies and AI in data management CC CDQ Research Focus in 2019 Data lake governance Data management capabilities for regulatory compliance Data quality and business impact Data catalogs AI in data management Data management rollout to new domains Data platforms & ecosystems Data strategies
  • 8. Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 7 Agenda AI/ ML’s potential in data management2 Research process and outcomes3 Machine learning technique archetypes in detail4 Conclusion and outlook 20195 Data management for the digital enterprise1
  • 9. Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 8 AI has a long history but major progress has been done recently due to abundant data and cheap processing power DATA COMPUTING POWER AI/ML https://www.flickr.com/photos/103454225@N06/9965173654 https://commons.wikimedia.org/wiki/File:Wikimedia_Foundation_Servers-8055_10.jpg Artificial Intelligence Methusalix from Asterix & Obelix
  • 10. Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 9 Potential for automation with computers is increasing https://twitter.com/andrewyng/status/788548053745569792?lang=en
  • 11. Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 10 Machines become creative: First picture generated with machine learning technique sold at Christies for 432 500 $ A machine was feeded with digital pictures of various artists of the past centuries. Based on this knowledge, it learned different drawing styles and painted its own picture. https://www.theverge.com/2018/10/25/18023266/ai-art-portrait-christies-obvious-sold https://www.christies.com/features/A-collaboration-between-two-artists-one-human-one-a-machine-9332-1.aspx
  • 12. Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 11 Some old principle holds true “Garbage in, garbage out” Data quality is among top-management’s major concerns when thinking about implementing AI.1 1Pyle and JosĂŠ (2015) Bad training data Bad resultsAI
  • 13. Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 12 Machine learning holds great potential to improve data quality and support data management https://www.bloomberg.com/professional/blog/machine-learning-plays-critical-role-improving-data-quality/ • Hidden factories: bad data quality costs an enterprise between 15 - 25 % of their annual revenue.1 • Data scientists spent almost 80% of their time collecting, cleaning and organizing data.2 Tamr, a machine learning curation system, could prove in three real world enterprise curation problems that it can lower curation cost by about 90%.3 1Redman (2017) 2Crowdflower (2016) 3Stonebraker et al (2013)
  • 14. Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 13 Agenda AI/ ML’s potential in data management2 Research process and outcomes3 Machine learning technique archetypes in detail4 Conclusion and outlook 20195 Data management for the digital enterprise1
  • 15. Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 14 CC CDQ research has identified and classified ML techniques for data management Roles Data problems Data quality impact Data input type Data output type Machine learning approach Context Application ML techniques for data management Data collector Data custodian Data consumer Data cleaning Entity resolution Data transformation Data integration Data provenance Metadata/ Schema discovery Data and metadata profiling Data and metadata profiling Data archiving Data enrichment Data generation Data monitoring Intrinsic Contextual Representational Accessibility Structured Semi-structured Unstructured Structured Semi-structured Unstructured Supervised Active learning Unsupervised Outcome 1: TAXONOMY Outcome 2: 11 ARCHETYPES OF ML TECHNIQUES FOR DM Questions: • How can ML techniques support data management and improve data quality (DQ)? • What are typical applications of ML techniques in data management? SCIENTIFIC RESEARCH EXPERT KNOWLEDGE TOOLS 44 Cases
  • 16. Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 15 In a first step, we developed a taxonomy for classifying different ML techniques for data management ML techniques for data management Data management context ML technique Roles Data collector Data custodian Data consumer Data problems Data cleaning Entity resolution Data transformation Data integration Data provenance Metadata/ Schema discovery Data and metadata profiling Data archiving Data enrichment Data generation Data monitoring Data quality impact Intrinsic Contextual Representational Accessibility Data input type Structured Semi-structured Unstructured Data output type Structured Semi-structured Unstructured Machine learning approach Supervised Active learning Unsupervised
  • 17. Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 16 In a second step, we analyzed and classified 44 cases applying ML techniques based on the taxonomy Source Category Short title 1.1 1.2 1.3 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 3.1 3.2 3.3 3.4 4.1 4.2 4.3 5.1 5.2 5.3 6.1 6.2 6.3 1.1 Experte 1 Expert Data entry: Default values by data entry 1 1 1 1 1 1 1 7 Parikh et al (2010) Research Predict data entry 1 1 1 1 1 1 1 26 Grammarly Tool Correction of syntactic and semantic errors in text 1 1 1 1 1 1 2 Experte 2 Expert Detect material data from pictures 1 1 1 1 1 1 1 5 Sarawagi et al (2001) Research Extract data object from text 1 1 1 1 1 1 1 6 Knoblock et al (2003) Research Learn to extract data from web 1 1 1 1 1 1 1 9 Hu et al (2017) Research Extract data object from text 1 1 1 1 1 1 18 Wu et al (2016) Research Google’s Neural Machine Translation System 1 1 1 1 1 1 29 bonobo.ai Tool Retrieve insights from customer support calls, texts and other interactions 1 1 1 1 1 1 1 1 34 DeepL Tool DeepL translation service 1 1 1 1 1 1 1 8 Hui Han et al (2003) Research Extract metadata from documents 1 1 1 1 1 1 1 1.2 Experte 1 Expert Data enrichment: Forecast processing time 1 1 1 1 1 1 1 1.3 Experte 1 Expert Data enrichment: Assign product code from description 1 1 1 1 1 1 1 3 Experte 3 Expert Predicting the tariff code of a material master 1 1 1 1 1 1 1 28 commercetools Tool Automatically categorize products with machine learning 1 1 1 1 1 1 1 32.2 Reltio Tool (2) Machine learning-assisted data enrichment 1 1 1 1 1 1 1 1 25 AX semantics Tool Generation of product descriptions 1 1 1 1 1 1 1 27 Flixstock Tool Generation of catalog pictures using semantic segmentation 1 1 1 1 1 1 10 Volkovs et al (2014) Research Continuous Data Cleaning 1 1 1 1 1 1 1 15 Pit--Claudel et al (2016) Research Outlier Detection in Heterogeneous Datasets 1 1 1 1 1 1 16 Yakout et al (2011) Research Guided Data Repair 1 1 1 1 1 1 1 14 Bhamidipaty and Sarawagi (2002) Research Interactive Deduplication using Active Learning 1 1 1 1 1 1 17.2 Stonebraker et al (2013) Research Data curation system (2): entity resolution 1 1 1 1 1 1 31 Talend Tool Using machine learning for data matching 1 1 1 1 1 1 32.1 Reltio Tool (1) Machine learning-assisted data matching 1 1 1 1 1 1 12 Liu et al (2000) Research Database Integration using Neural Networks 1 1 1 1 1 1 1 13 Halevy et al (2002) Research Learn mappings between ontologies 1 1 1 1 1 1 17.1 Stonebraker et al (2013) Research Data curation system (1): schema integration 1 1 1 1 1 1 21 Berlin and Motro (2002) Research Schema matching using machine learning 1 1 1 1 1 1 22 Doan et al (2001) Research Schema matching using machine learning 1 1 1 1 1 1 23 Eckert et al (2009) Research Improving ontology matching using meta-level learning 1 1 1 1 1 1 24 Shi et al (2009) Research Actively learning ontology matching via user interaction 1 1 1 1 1 1 36 Octopai Tool Discover metadata and trace data across systems (data lineage) 1 1 1 1 1 1 1 1 1 33.1 Amazon macie Tool Automatically discover, classify, and protect sensitive data in AWS 1 1 1 1 1 1 1 35.2 pingar Discoveryone Tool (2) Detect sensitive data across systems 1 1 1 1 1 1 1 1 4.1 Experte 1 Expert Business rules mining (1): extract and cluster rules 1 1 1 1 1 1 11 Hipp et al (2001) Research Data quality mining: Association rules learning 1 1 1 1 1 1 19 Leser et al (2009) Research A Machine Learning Approach to Foreign Key Discovery 1 1 1 1 1 1 35.1 pingar Discoveryone Tool (1) Detect data to retire 1 1 1 1 1 1 1 4.2 Experte 1 Expert Business rules mining (2): predict interestingness from previous ratings 1 1 1 1 1 1 33.2 Amazon macie Tool Detect unauthorized access and avoid data leak 1 1 1 1 1 1 1 20 Fernandez et al (2018) Research Linking Datasets using Word Embeddings for Data Discovery 1 1 1 1 1 1 1 1 1 30 Alation Tool Recommendation of tables joins 1 1 1 1 1 1 1 35.3 pingar Discoveryone Tool (3) Discover Metadata (named entities, keyphrases, pre-defined categories) 1 1 1 1 1 1 1 1 ID 1 - Roles 2 - Data problems 3 - DQ impact 4 - Data input type 5 - Data output type 6- ML approach ML Technique 11 typical applications of ML techniques in data management (“archetypes”) have been identified 44 cases applying ML techniques in data management: • 7 from experts • 21 from tool vendors • 16 from research
  • 18. Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 17 The results will be published in a whitepaper in Q1/2019 Publication in Q1/2019
  • 19. Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 18 Agenda AI/ ML’s potential in data management2 Research process and outcomes3 Machine learning technique archetypes in detail4 Conclusion and outlook 20195 Data management for the digital enterprise1
  • 20. Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 19 ML supports all data management phases Acquire & create Unify & maintain Protect & retire Discover & use Data collector Data producer Data custodian Data protection officer MACHINE LEARNING Data consumer
  • 21. Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 20 ML supports all data management phases Acquire & create Unify & maintain Protect & retire Discover & use Data collector Data producer Data custodian Data protection officer MACHINE LEARNING Data consumer
  • 22. Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 21 Acquire and clean: ML-assisted data creation and enrichment Data collector Data producer AUTOMATION BENEFITS • auto-filling values in forms • automatic extraction of data from unstructured data • assignment of attributes/ values • ÂťFirst-time-rightÂŤ principle • Increased efficiency in data entry and creation LEARNING • Data entry patterns • Data incidents • Data extraction patterns • Data creation patterns Data pain points: • manual effort for data creation • wrong or invalid data entries (typos, blank fields, errors)
  • 23. Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 22 Case study Robert Bosch GmbH: Assignment of correct custom tariffs supported with ML
  • 24. Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 23 Case study Robert Bosch GmbH: Assignment of correct custom tariffs supported with ML Problem: • Customs tariff numbers (“commodity code”) are a binding legal requirement. • Consequences of a wrong code can be significant, e.g. a delayed customs clearance or severe penalties. Initial situation: • Assignment of commodity codes is implemented at Bosch as global uniform process • Central team of highly qualified experts (Center of Expertise; CoE) assigns the correct codes on GTS within hours, but is limited in its capacities • 200.000 classified material masters with 200 different commodity codes for 26 countries and the European Union • 70% of Bosch’s business units are covered by this global classification process and increase in classification requests is expected CDQ Good Practice Award 2018 submission: https://www.cc-cdq.ch/sites/default/files/cdq_award/CDQ%20Good%20Practice%20Award%202018_Robert%20Bosch.pdf Example code:
  • 25. Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 24 Case study Robert Bosch GmbH: Assignment of correct custom tariffs supported with ML CDQ Good Practice Award 2018 submission: https://www.cc-cdq.ch/sites/default/files/cdq_award/CDQ%20Good%20Practice%20Award%202018_Robert%20Bosch.pdf Supervised machine learning to learn the relationship between features and a label Training based on a set of 11.000 and 50.000 unique instances, both having about 200 different labels. The ML solution enables an automated assignment of commodity codes with high accuracy (90%).
  • 26. Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 25 ML supports all data management phases Acquire & create Unify & maintain Protect & retire Discover & use Data collector Data producer Data custodian Data protection officer MACHINE LEARNING Data consumer
  • 27. Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 26 Unify and maintain: ML-assisted data maintenance and unification Data custodian • Reactive: data correction • Proactive: business rules • Data unification • Increased efficiency of (reactive/proactive) data maintenance and unification processes • Data repairing patterns • Association rules • Outlier detection • Semantic mapping Data pain points: • Correction of data errors (reactive) • Lacking expert know-how for definition of business rules (proactive) • Data integration from multiple systems (redundancies, inconsistencies) BENEFITSLEARNING AUTOMATION
  • 28. Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 27 Case study tamr: Data Curation at Scale: The Data Tamer System Tamer system (later tamr company) reduced data curation costs by 90% in three real world examples: Web aggregator: • Federation into a semantically cohesive collection of facts of 80.000 URLs with 13 Million records and 200k local attributes in total Biology application: • Integrating lab reports of 8000 biologists and chemists • Each spreadsheet has ca 1 million rows and at total of 100k attribute names Health service: • Deduplication and aggregation of an integrated database of claim records of 300 insurance carriers Stonebraker et al. (2013)
  • 29. Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 28 Case study tamr: Data Curation at Scale: The Data Tamer System Tamr “Technical whitepaper”
  • 30. Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 29 ML supports all data management phases Acquire & create Unify & maintain Protect & retire Discover & use Data collector Data producer Data custodian Data protection officer MACHINE LEARNING Data consumer
  • 31. Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 30 Protect and retire: ML-assisted data protection and retirement • Identification of sensitive data • Identification of “end-of-life” • Reduced risk • Increased regulatory compliance • PII identifiers • fraudulent data access behavior Data pain points: • No transparency where personally identifiable information (PII) is stored • Compliance with data protection regulations Data protection officer Data custodian: - Data steward - Data manager - … BENEFITSLEARNING AUTOMATION Data pain points: • Retirement of data
  • 32. Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 31 Case Study Corporate Data League: Identification of natural persons with ML n European data privacy regulation (GDPR) forces companies to handle personal information carefully. n In most supplier and customer databases, some records represent natural persons (e.g. doctors, freelancers, or contacts). And e.g. address data may be considered as personal information. n Our tools are trained with tons of global forenames, surnames, legal forms etc., and are able to identify natural person records. Verify & enrich Our “GDPR Screening” services can also be listed as “technical and organizational measure” (TOM) for GDPR audits. Remark Example Name Country Category Castillo Karem PA Legal Entity Ali Ahmad Ali BASAHI YE Natural Person Derald Grue US Natural Person EXPERTISES GALTIER FR Legal Entity JesĂşs GarcĂ­a Pastor ES Natural Person Landwirtschaftsbetrieb Thomas Frahm DE Legal Entity UFK po Respublike Karelia RU Legal Entity Sonesson Ronni SE Natural Person Eduardo Alves da Cunha BR Natural Person Shanghai Ju Yang Forging Machine CN Legal Entity Ecopack Bulgaria AD BG Legal Entity Neborachek S.I. UA Legal Entity Alper Ozel TR Natural Person Hassan Ali Shah PK Natural Person Amit Kumar Singh IN Natural Person Rosenborgs Bakeri AS NO Legal Entity ROBERTO RAMOS ANTON ES Natural Person FRANCESCO GIUSEPPE GENOVESE IT Natural Person BORA KECIC AD-specijalni transporti RS Legal Entity
  • 33. Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 32 Case Study Corporate Data League: Identification of natural persons with ML “CDQ AG” → 0 “CDQ GmbH” → 0 “Martin Fadler” → 1 (Natural person) “Tobias Pentek” → 1 (Natural person) “Microsoft Deutschland” → 0 “Marcus Evans” → 0
  • 34. Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 33 Case Study Corporate Data League: Identification of natural persons with ML Business partner data Identified natural persons Convolutional Neural Network learns mapping between input and output
  • 35. Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 34 ML supports all data management phases Acquire & create Unify & maintain Protect & retire Discover & use Data collector Data producer Data custodian Data protection officer MACHINE LEARNING Data consumer
  • 36. Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 35 Discover and use: ML-assisted data discovery • Recommendations • Linking of datasets • Increased access to, and use of, data • easier interpretability of data • Data usage • Semantic mapping Data pain points: • Finding and cleaning relevant data • Identify relationships between data sets Data consumer, e.g.: • Data scientist • Data citizen • Developer BENEFITSLEARNING AUTOMATION
  • 37. Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 36 Alation: Recommend relevant data, e.g. sql join tables https://alation.com/product/ Machine learns which tables are joined most likely. AI/ ML User writes SQL query in data catalog DATA CATALOG Previous SQL queries AI/ ML proposes tables to join. User joins proposed table and finds in this way data much faster. 4 1 3 2
  • 38. Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 37 Agenda AI/ ML’s potential in data management2 Research process and outcomes3 Machine learning technique archetypes in detail4 Conclusion and outlook 20195 Data management for the digital enterprise1
  • 39. Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 38 Until now, overall 11 archetypes of ML techniques for data management have been identified Process Archetype Case IDs DQ impact Acquire and create Support the manual data entry: Default values and data repair by data entry (Data cleaning) 1.1, 7, 26 Intrinsic and contextual Extract data from unstructured data: Data object extraction from unstructured data, text data translation and metadata extraction to create data automatically (Data transformation and metadata discovery) 2, 5, 6, 9, 18, 29, 34 Intrinsic and contextual Create data automatically: Data enrichment and generation to automate further processing and manual creation activities (Data enrichment and generation) 1.2, 1.3, 3, 28, 32.2, 25, 27, Contextual and representational Unify and maintain Guide data cleaning process: Correct errors and inconsistencies with active learning (Data cleaning, entity resolution) 10, 14, 15, 16, 17.2, 32.1 Intrinsic and contextual Learn rules from data: Extract and update the set of quality and business rules to avoid data errors (Data and metadata profiling) 4.1, 11 Representational Interpret semantics for data integration: Semantic integration of data tables and ontologies, also in an interactive manner with active learning (Data integration) 12, 13, 17.1, 21, 22, 23, 24 Representational and accessibility Protect and retire Detect data across systems: Find and trace sensitive data and data that needs to be retired across systems (Metadata or schema discovery, data provenance) 36, 33.1, 35.1, 35.2 Representational and accessibility Detect suspicious data usage: Detect unauthorized access and data leaks from data access patterns (Data monitoring) 33.2 Representational and accessibility Discover and use Provide context data (Data enrichment) 4.2, 35.3 Representational Interpret semantics to link data sets: Semantic integration of data (Data integration and data provenance) 20 Contextual, representational and accesibility Recommend relevant data (Data integration) 30 Contextual and accessibility
  • 40. Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 39 The evaluation shows clearly the potential, but also the early stage of AI in data management UTILITY IMPLEMENTATION Results of survey 8 participants CC CDQ workshop 26.09.2018 CANDIDATES WITH HIGH POTENTIAL Likert scala values: 1: Implementation: 1 (Not started), 2 (Evaluating), 3 (Prototyping), 4 (Project), 5 (Operational) 2: Utility: 1 (Very low), 2 (Low), 3 (Moderate), 4 (High), 5 (Very high) 1 2 Most of the identified usage scenarios seem to have a clear relevance and a utility of high to very high. Still, all of the scenarios are in their beginning of investigation. Only one scenario made it to a prototype. W ork in progress
  • 41. Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 40 The evaluation shows clearly the potential but also the early stage of AI in data management Likert scala values: 1 - Implementation: 1 (Not started), 2 (Evaluating), 3 (Prototyping), 4 (Project), 5 (Operational) 2 - Utility: 1 (Very low), 2 (Low), 3 (Moderate), 4 (High), 5 (Very high) 4 - 5 1 - 2 2 - 3 3 - 4 Category No. Usage scenarios Implementation Utility Difference Acquire and create data 2.5.1a Auto-fill and typo correction by data entry 2.63 4.14 1.52 2.5.1b Data object detection in unstructured data 1.88 3.50 1.63 2.5.1c Text data translation and generation 1.00 3.33 2.33 2.5.1d Chatbots for data entry 1.50 2.86 1.36 Unify and maintain data 2.5.2a Detection and correction of errors and inconsistencies 1.43 4.57 3.14 2.5.2b Rules validation and exten-sion 2.14 4.00 1.86 2.5.2c Data matching 2.00 4.14 2.14 2.5.2d Data Enrichment 1.75 4.14 2.39 Protect and retire data 2.5.3a Sensitive information detec-tion 1.50 3.71 2.21 2.5.3b Identification of data that needs to be retired 1.50 3.57 2.07 Discover and use data 2.5.4a Key word based and semantic search 1.50 3.43 1.93 2.5.4b Chat bot to find data 1.00 3.00 2.00 2.5.4c Semantic integration of data from heterogenous sources 1.13 3.14 2.02 2.5.4d Recommendation of data 1.75 2.71 0.96 W ork in progress
  • 42. Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 41 Expert Survey “Machine learning techniques for data management implementation status and future potential” If you would like to participate in the expert survey and receive the results, please click here.
  • 43. Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 42 Learning and implications for data management Findings 1. Machine learning has significant potential to improve data quality but will at the same time disrupt how data is managed 2. Highly repetitive and simple cases will be automated by machine but human needs to intervene in more difficult and complex cases 3. Redesign of work processes required: - Machine takes over prediction - Human judges output and confirms 4. New roles and skills required
  • 44. Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 43 ML in data management – Outlook on CC CDQ research activities in 2019 n Impact of AI/ML on shared service center and data management processes n Prototypical implementation of scenarios with high potential Research topics n Ongoing screening of ML techniques n Update on taxonomy and archetypes n Survey on utility and future potential of ML techniques Research activities n Provide benchmark datasets for researchIdea
  • 45. Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 44 www.cdq.ch CDQ AG www.cc-cdq.ch CC CDQ Portal www.cdq.iwi.unisg.ch CDQ Academy www.xing.com/net/cdqm CC CDQ Community at XING https://twitter.com/cdq_ag CDQ at Twitter https://www.linkedin.com/groups/8137247 CC CDQ Community at LinkedIn Please reach out to us if you have any further questions or topics to discuss Competence Center Corporate Data Quality martin.fadler@cdq.ch Research associate +41 78 405 16 80‬ Competence Center Corporate Data Quality tobias.pentek@cdq.ch Head of Community and Innovation Tobias Pentek Martin Fadler (Ph.D. cand.)
  • 46. Machine learning techniques for data management – Martin Fadler, Tobias Pentek | 45 Q&A Thank you! Questions?