Machine learning techniques to improve data management and data quality - presentation by Tobias Pentek and Martin Fadler from the Competence Center Corporate Data Quality. This presentation was presented during the Marcus Evans Event in Amsterdam 08.02.2019
Machine learning techniques to improve data management and data quality
1. Machine learning techniques for data management â Martin Fadler, Tobias Pentek | 0
Tobias Pentek
Martin Fadler
Competence Center
Corporate Data Quality
(CC CDQ)
Machine learning techniques to improve data
management and data quality
Taxonomy, archetypes and case studies
Marcus Evans, Amsterdam, 08.02.2019
2. Machine learning techniques for data management â Martin Fadler, Tobias Pentek | 1
Agenda
AI/ MLâs potential in data management2
Research process and outcomes3
Machine learning technique archetypes in detail4
Conclusion and outlook 20195
Data management for the digital enterprise1
3. Machine learning techniques for data management â Martin Fadler, Tobias Pentek | 2
Agenda
AI/ MLâs potential in data management2
Research process and outcomes3
Machine learning technique archetypes in detail4
Conclusion and outlook 20195
Data management for the digital enterprise1
4. Machine learning techniques for data management â Martin Fadler, Tobias Pentek | 3
The Competence Center Corporate Data Quality (CC CDQ)
is a research consortium
2006
Foundation
+35
Members
+60
CC CDQ
Workshops
12
PhD
Graduates
+1500
Contacts within
CDQ community
5. Machine learning techniques for data management â Martin Fadler, Tobias Pentek | 4
Practitioners and researcher jointly develop innovative data
management solutions in the CC CDQ
Current members of the Competence Center Corporate Data Quality
Consortium research is conducted in association between research institutions and companies
6. Machine learning techniques for data management â Martin Fadler, Tobias Pentek | 5
There are multiple publications from the CC CDQ
for your reference
7. Machine learning techniques for data management â Martin Fadler, Tobias Pentek | 6
Current research topics cover data catalogs, data lake
governance, data strategies and AI in data management
CC CDQ Research Focus in 2019
Data lake
governance
Data
management
capabilities for
regulatory
compliance
Data quality
and
business
impact
Data
catalogs
AI in data
management
Data
management
rollout to
new
domains
Data
platforms &
ecosystems
Data
strategies
8. Machine learning techniques for data management â Martin Fadler, Tobias Pentek | 7
Agenda
AI/ MLâs potential in data management2
Research process and outcomes3
Machine learning technique archetypes in detail4
Conclusion and outlook 20195
Data management for the digital enterprise1
9. Machine learning techniques for data management â Martin Fadler, Tobias Pentek | 8
AI has a long history but major progress has been done
recently due to abundant data and cheap processing power
DATA COMPUTING
POWER
AI/ML
https://www.flickr.com/photos/103454225@N06/9965173654 https://commons.wikimedia.org/wiki/File:Wikimedia_Foundation_Servers-8055_10.jpg
Artificial
Intelligence
Methusalix from Asterix & Obelix
10. Machine learning techniques for data management â Martin Fadler, Tobias Pentek | 9
Potential for automation with computers is increasing
https://twitter.com/andrewyng/status/788548053745569792?lang=en
11. Machine learning techniques for data management â Martin Fadler, Tobias Pentek | 10
Machines become creative: First picture generated with
machine learning technique sold at Christies for 432 500 $
A machine was feeded with digital pictures of various artists of the past
centuries. Based on this knowledge, it learned different drawing styles and
painted its own picture.
https://www.theverge.com/2018/10/25/18023266/ai-art-portrait-christies-obvious-sold https://www.christies.com/features/A-collaboration-between-two-artists-one-human-one-a-machine-9332-1.aspx
12. Machine learning techniques for data management â Martin Fadler, Tobias Pentek | 11
Some old principle holds true âGarbage in, garbage outâ
Data quality is among top-managementâs
major concerns when thinking about
implementing AI.1
1Pyle and JosĂŠ (2015)
Bad training data Bad resultsAI
13. Machine learning techniques for data management â Martin Fadler, Tobias Pentek | 12
Machine learning holds great potential to improve data
quality and support data management
https://www.bloomberg.com/professional/blog/machine-learning-plays-critical-role-improving-data-quality/
⢠Hidden factories: bad data quality costs an enterprise between 15 - 25 % of their annual revenue.1
⢠Data scientists spent almost 80% of their time collecting, cleaning and organizing data.2
Tamr, a machine learning curation
system, could prove in three real
world enterprise curation problems
that it can lower curation cost by
about 90%.3
1Redman (2017)
2Crowdflower (2016)
3Stonebraker et al (2013)
14. Machine learning techniques for data management â Martin Fadler, Tobias Pentek | 13
Agenda
AI/ MLâs potential in data management2
Research process and outcomes3
Machine learning technique archetypes in detail4
Conclusion and outlook 20195
Data management for the digital enterprise1
15. Machine learning techniques for data management â Martin Fadler, Tobias Pentek | 14
CC CDQ research has identified and classified ML
techniques for data management
Roles Data problems
Data quality
impact
Data input
type
Data output
type
Machine
learning
approach
Context Application
ML techniques for
data management
Data collector
Data custodian
Data consumer
Data cleaning
Entity resolution
Data transformation
Data integration
Data provenance
Metadata/ Schema
discovery
Data and metadata
profiling
Data and metadata
profiling
Data archiving
Data enrichment
Data generation
Data monitoring
Intrinsic
Contextual
Representational
Accessibility
Structured
Semi-structured
Unstructured
Structured
Semi-structured
Unstructured
Supervised
Active learning
Unsupervised
Outcome 1:
TAXONOMY
Outcome 2:
11 ARCHETYPES OF
ML TECHNIQUES
FOR DM
Questions:
⢠How can ML techniques support data management and improve data quality (DQ)?
⢠What are typical applications of ML techniques in data management?
SCIENTIFIC
RESEARCH
EXPERT
KNOWLEDGE
TOOLS
44
Cases
16. Machine learning techniques for data management â Martin Fadler, Tobias Pentek | 15
In a first step, we developed a taxonomy for classifying
different ML techniques for data management
ML techniques for
data management
Data
management
context
ML technique
Roles
Data collector
Data custodian
Data consumer
Data problems
Data cleaning
Entity resolution
Data transformation
Data integration
Data provenance
Metadata/ Schema
discovery
Data and metadata
profiling
Data archiving
Data enrichment
Data generation
Data monitoring
Data quality
impact
Intrinsic
Contextual
Representational
Accessibility
Data input
type
Structured
Semi-structured
Unstructured
Data output
type
Structured
Semi-structured
Unstructured
Machine
learning
approach
Supervised
Active learning
Unsupervised
17. Machine learning techniques for data management â Martin Fadler, Tobias Pentek | 16
In a second step, we analyzed and classified 44 cases
applying ML techniques based on the taxonomy
Source Category Short title 1.1 1.2 1.3 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 3.1 3.2 3.3 3.4 4.1 4.2 4.3 5.1 5.2 5.3 6.1 6.2 6.3
1.1 Experte 1 Expert Data entry: Default values by data entry 1 1 1 1 1 1 1
7 Parikh et al (2010) Research Predict data entry 1 1 1 1 1 1 1
26 Grammarly Tool Correction of syntactic and semantic errors in text 1 1 1 1 1 1
2 Experte 2 Expert Detect material data from pictures 1 1 1 1 1 1 1
5 Sarawagi et al (2001) Research Extract data object from text 1 1 1 1 1 1 1
6 Knoblock et al (2003) Research Learn to extract data from web 1 1 1 1 1 1 1
9 Hu et al (2017) Research Extract data object from text 1 1 1 1 1 1
18 Wu et al (2016) Research Googleâs Neural Machine Translation System 1 1 1 1 1 1
29 bonobo.ai Tool Retrieve insights from customer support calls, texts and other interactions 1 1 1 1 1 1 1 1
34 DeepL Tool DeepL translation service 1 1 1 1 1 1 1
8 Hui Han et al (2003) Research Extract metadata from documents 1 1 1 1 1 1 1
1.2 Experte 1 Expert Data enrichment: Forecast processing time 1 1 1 1 1 1 1
1.3 Experte 1 Expert Data enrichment: Assign product code from description 1 1 1 1 1 1 1
3 Experte 3 Expert Predicting the tariff code of a material master 1 1 1 1 1 1 1
28 commercetools Tool Automatically categorize products with machine learning 1 1 1 1 1 1 1
32.2 Reltio Tool (2) Machine learning-assisted data enrichment 1 1 1 1 1 1 1 1
25 AX semantics Tool Generation of product descriptions 1 1 1 1 1 1 1
27 Flixstock Tool Generation of catalog pictures using semantic segmentation 1 1 1 1 1 1
10 Volkovs et al (2014) Research Continuous Data Cleaning 1 1 1 1 1 1 1
15 Pit--Claudel et al (2016) Research Outlier Detection in Heterogeneous Datasets 1 1 1 1 1 1
16 Yakout et al (2011) Research Guided Data Repair 1 1 1 1 1 1 1
14 Bhamidipaty and
Sarawagi (2002)
Research Interactive Deduplication using Active Learning
1 1 1 1 1 1
17.2 Stonebraker et al (2013) Research Data curation system (2): entity resolution 1 1 1 1 1 1
31 Talend Tool Using machine learning for data matching 1 1 1 1 1 1
32.1 Reltio Tool (1) Machine learning-assisted data matching 1 1 1 1 1 1
12 Liu et al (2000) Research Database Integration using Neural Networks 1 1 1 1 1 1 1
13 Halevy et al (2002) Research Learn mappings between ontologies 1 1 1 1 1 1
17.1 Stonebraker et al (2013) Research Data curation system (1): schema integration 1 1 1 1 1 1
21 Berlin and Motro (2002) Research Schema matching using machine learning 1 1 1 1 1 1
22 Doan et al (2001) Research Schema matching using machine learning 1 1 1 1 1 1
23 Eckert et al (2009) Research Improving ontology matching using meta-level learning 1 1 1 1 1 1
24 Shi et al (2009) Research Actively learning ontology matching via user interaction 1 1 1 1 1 1
36 Octopai Tool Discover metadata and trace data across systems (data lineage) 1 1 1 1 1 1 1 1 1
33.1 Amazon macie Tool Automatically discover, classify, and protect sensitive data in AWS 1 1 1 1 1 1 1
35.2 pingar Discoveryone Tool (2) Detect sensitive data across systems 1 1 1 1 1 1 1 1
4.1 Experte 1 Expert Business rules mining (1): extract and cluster rules 1 1 1 1 1 1
11 Hipp et al (2001) Research Data quality mining: Association rules learning 1 1 1 1 1 1
19 Leser et al (2009) Research A Machine Learning Approach to Foreign Key Discovery 1 1 1 1 1 1
35.1 pingar Discoveryone Tool (1) Detect data to retire 1 1 1 1 1 1 1
4.2 Experte 1 Expert Business rules mining (2): predict interestingness from previous ratings 1 1 1 1 1 1
33.2 Amazon macie Tool Detect unauthorized access and avoid data leak 1 1 1 1 1 1 1
20 Fernandez et al (2018) Research Linking Datasets using Word Embeddings for Data Discovery 1 1 1 1 1 1 1 1 1
30 Alation Tool Recommendation of tables joins 1 1 1 1 1 1 1
35.3 pingar Discoveryone Tool (3) Discover Metadata (named entities, keyphrases, pre-defined categories) 1 1 1 1 1 1 1 1
ID
1 - Roles 2 - Data problems 3 - DQ impact
4 - Data
input type
5 - Data
output type
6- ML
approach
ML Technique
11 typical applications of ML
techniques in data
management (âarchetypesâ)
have been identified
44 cases applying ML
techniques in data
management:
⢠7 from experts
⢠21 from tool vendors
⢠16 from research
18. Machine learning techniques for data management â Martin Fadler, Tobias Pentek | 17
The results will be published in a whitepaper in Q1/2019
Publication in Q1/2019
19. Machine learning techniques for data management â Martin Fadler, Tobias Pentek | 18
Agenda
AI/ MLâs potential in data management2
Research process and outcomes3
Machine learning technique archetypes in detail4
Conclusion and outlook 20195
Data management for the digital enterprise1
20. Machine learning techniques for data management â Martin Fadler, Tobias Pentek | 19
ML supports all data management phases
Acquire &
create
Unify &
maintain
Protect &
retire
Discover
& use
Data collector
Data producer
Data
custodian
Data
protection
officer
MACHINE LEARNING
Data
consumer
21. Machine learning techniques for data management â Martin Fadler, Tobias Pentek | 20
ML supports all data management phases
Acquire &
create
Unify &
maintain
Protect &
retire
Discover
& use
Data collector
Data producer
Data
custodian
Data
protection
officer
MACHINE LEARNING
Data
consumer
22. Machine learning techniques for data management â Martin Fadler, Tobias Pentek | 21
Acquire and clean:
ML-assisted data creation and enrichment
Data collector
Data producer
AUTOMATION BENEFITS
⢠auto-filling values in forms
⢠automatic extraction of data
from unstructured data
⢠assignment of attributes/ values
⢠First-time-right principle
⢠Increased efficiency in
data entry and creation
LEARNING
⢠Data entry patterns
⢠Data incidents
⢠Data extraction patterns
⢠Data creation patterns
Data pain points:
⢠manual effort for data creation
⢠wrong or invalid data entries
(typos, blank fields, errors)
23. Machine learning techniques for data management â Martin Fadler, Tobias Pentek | 22
Case study Robert Bosch GmbH:
Assignment of correct custom tariffs supported with ML
24. Machine learning techniques for data management â Martin Fadler, Tobias Pentek | 23
Case study Robert Bosch GmbH:
Assignment of correct custom tariffs supported with ML
Problem:
⢠Customs tariff numbers (âcommodity codeâ) are
a binding legal requirement.
⢠Consequences of a wrong code can be significant,
e.g. a delayed customs clearance or severe penalties.
Initial situation:
⢠Assignment of commodity codes is implemented at Bosch as global uniform process
⢠Central team of highly qualified experts (Center of Expertise; CoE) assigns the correct codes
on GTS within hours, but is limited in its capacities
⢠200.000 classified material masters with 200 different commodity codes for 26 countries and
the European Union
⢠70% of Boschâs business units are covered by this global classification process and increase
in classification requests is expected
CDQ Good Practice Award 2018 submission:
https://www.cc-cdq.ch/sites/default/files/cdq_award/CDQ%20Good%20Practice%20Award%202018_Robert%20Bosch.pdf
Example code:
25. Machine learning techniques for data management â Martin Fadler, Tobias Pentek | 24
Case study Robert Bosch GmbH:
Assignment of correct custom tariffs supported with ML
CDQ Good Practice Award 2018 submission:
https://www.cc-cdq.ch/sites/default/files/cdq_award/CDQ%20Good%20Practice%20Award%202018_Robert%20Bosch.pdf
Supervised machine learning to learn the relationship
between features and a label
Training based on a set of 11.000 and 50.000 unique
instances, both having about 200 different labels.
The ML solution enables an automated assignment of
commodity codes with high accuracy (90%).
26. Machine learning techniques for data management â Martin Fadler, Tobias Pentek | 25
ML supports all data management phases
Acquire &
create
Unify &
maintain
Protect &
retire
Discover
& use
Data collector
Data producer
Data
custodian
Data
protection
officer
MACHINE LEARNING
Data
consumer
27. Machine learning techniques for data management â Martin Fadler, Tobias Pentek | 26
Unify and maintain:
ML-assisted data maintenance and unification
Data custodian
⢠Reactive: data correction
⢠Proactive: business rules
⢠Data unification
⢠Increased efficiency of
(reactive/proactive) data
maintenance and
unification processes
⢠Data repairing patterns
⢠Association rules
⢠Outlier detection
⢠Semantic mapping
Data pain points:
⢠Correction of data errors (reactive)
⢠Lacking expert know-how for definition
of business rules (proactive)
⢠Data integration from multiple systems
(redundancies, inconsistencies)
BENEFITSLEARNING AUTOMATION
28. Machine learning techniques for data management â Martin Fadler, Tobias Pentek | 27
Case study tamr:
Data Curation at Scale: The Data Tamer System
Tamer system (later tamr company) reduced data curation costs
by 90% in three real world examples:
Web aggregator:
⢠Federation into a semantically cohesive collection of facts of 80.000 URLs with 13 Million
records and 200k local attributes in total
Biology application:
⢠Integrating lab reports of 8000 biologists and chemists
⢠Each spreadsheet has ca 1 million rows and at total of 100k attribute names
Health service:
⢠Deduplication and aggregation of an integrated database of claim records of 300
insurance carriers
Stonebraker et al. (2013)
29. Machine learning techniques for data management â Martin Fadler, Tobias Pentek | 28
Case study tamr:
Data Curation at Scale: The Data Tamer System
Tamr âTechnical whitepaperâ
30. Machine learning techniques for data management â Martin Fadler, Tobias Pentek | 29
ML supports all data management phases
Acquire &
create
Unify &
maintain
Protect &
retire
Discover
& use
Data collector
Data producer
Data
custodian
Data
protection
officer
MACHINE LEARNING
Data
consumer
31. Machine learning techniques for data management â Martin Fadler, Tobias Pentek | 30
Protect and retire:
ML-assisted data protection and retirement
⢠Identification of sensitive data
⢠Identification of âend-of-lifeâ
⢠Reduced risk
⢠Increased regulatory
compliance
⢠PII identifiers
⢠fraudulent data access
behavior
Data pain points:
⢠No transparency
where personally
identifiable information
(PII) is stored
⢠Compliance with data
protection regulations
Data protection officer
Data custodian:
- Data steward
- Data manager
- âŚ
BENEFITSLEARNING AUTOMATION
Data pain points:
⢠Retirement of
data
32. Machine learning techniques for data management â Martin Fadler, Tobias Pentek | 31
Case Study Corporate Data League:
Identification of natural persons with ML
n European data privacy regulation (GDPR) forces
companies to handle personal information carefully.
n In most supplier and customer databases, some
records represent natural persons (e.g. doctors,
freelancers, or contacts). And e.g. address data may
be considered as personal information.
n Our tools are trained with tons of global forenames,
surnames, legal forms etc., and are able to identify
natural person records.
Verify & enrich
Our âGDPR Screeningâ services can also be listed as âtechnical and
organizational measureâ (TOM) for GDPR audits.
Remark
Example
Name Country Category
Castillo Karem PA Legal Entity
Ali Ahmad Ali BASAHI YE Natural Person
Derald Grue US Natural Person
EXPERTISES GALTIER FR Legal Entity
JesĂşs GarcĂa Pastor ES Natural Person
Landwirtschaftsbetrieb Thomas Frahm DE Legal Entity
UFK po Respublike Karelia RU Legal Entity
Sonesson Ronni SE Natural Person
Eduardo Alves da Cunha BR Natural Person
Shanghai Ju Yang Forging Machine CN Legal Entity
Ecopack Bulgaria AD BG Legal Entity
Neborachek S.I. UA Legal Entity
Alper Ozel TR Natural Person
Hassan Ali Shah PK Natural Person
Amit Kumar Singh IN Natural Person
Rosenborgs Bakeri AS NO Legal Entity
ROBERTO RAMOS ANTON ES Natural Person
FRANCESCO GIUSEPPE GENOVESE IT Natural Person
BORA KECIC AD-specijalni transporti RS Legal Entity
33. Machine learning techniques for data management â Martin Fadler, Tobias Pentek | 32
Case Study Corporate Data League:
Identification of natural persons with ML
âCDQ AGâ â 0
âCDQ GmbHâ â 0
âMartin Fadlerâ â 1 (Natural person)
âTobias Pentekâ â 1 (Natural person)
âMicrosoft Deutschlandâ â 0
âMarcus Evansâ â 0
34. Machine learning techniques for data management â Martin Fadler, Tobias Pentek | 33
Case Study Corporate Data League:
Identification of natural persons with ML
Business
partner data
Identified
natural
persons
Convolutional Neural Network learns
mapping between input and output
35. Machine learning techniques for data management â Martin Fadler, Tobias Pentek | 34
ML supports all data management phases
Acquire &
create
Unify &
maintain
Protect &
retire
Discover
& use
Data collector
Data producer
Data
custodian
Data
protection
officer
MACHINE LEARNING
Data
consumer
36. Machine learning techniques for data management â Martin Fadler, Tobias Pentek | 35
Discover and use:
ML-assisted data discovery
⢠Recommendations
⢠Linking of datasets
⢠Increased access to, and
use of, data
⢠easier interpretability of
data
⢠Data usage
⢠Semantic mapping
Data pain points:
⢠Finding and cleaning relevant
data
⢠Identify relationships between
data sets
Data consumer, e.g.:
⢠Data scientist
⢠Data citizen
⢠Developer
BENEFITSLEARNING AUTOMATION
37. Machine learning techniques for data management â Martin Fadler, Tobias Pentek | 36
Alation:
Recommend relevant data, e.g. sql join tables
https://alation.com/product/
Machine learns
which tables are
joined most likely.
AI/ ML
User writes
SQL query in
data catalog
DATA
CATALOG
Previous SQL queries
AI/ ML
proposes
tables to
join.
User joins proposed
table and finds in this
way data much faster.
4
1
3
2
38. Machine learning techniques for data management â Martin Fadler, Tobias Pentek | 37
Agenda
AI/ MLâs potential in data management2
Research process and outcomes3
Machine learning technique archetypes in detail4
Conclusion and outlook 20195
Data management for the digital enterprise1
39. Machine learning techniques for data management â Martin Fadler, Tobias Pentek | 38
Until now, overall 11 archetypes of ML techniques for data
management have been identified
Process Archetype Case IDs DQ impact
Acquire and create
Support the manual data entry: Default values and data repair by data entry (Data cleaning) 1.1, 7, 26 Intrinsic and contextual
Extract data from unstructured data: Data object extraction from unstructured data, text data
translation and metadata extraction to create data automatically (Data transformation and
metadata discovery)
2, 5, 6, 9, 18,
29, 34
Intrinsic and contextual
Create data automatically: Data enrichment and generation to automate further processing
and manual creation activities (Data enrichment and generation)
1.2, 1.3, 3, 28,
32.2, 25, 27,
Contextual and
representational
Unify and maintain
Guide data cleaning process: Correct errors and inconsistencies with active learning (Data
cleaning, entity resolution)
10, 14, 15, 16,
17.2, 32.1
Intrinsic and contextual
Learn rules from data: Extract and update the set of quality and business rules to avoid data
errors (Data and metadata profiling)
4.1, 11 Representational
Interpret semantics for data integration: Semantic integration of data tables and ontologies,
also in an interactive manner with active learning (Data integration)
12, 13, 17.1, 21,
22, 23, 24
Representational and
accessibility
Protect and retire
Detect data across systems: Find and trace sensitive data and data that needs to be retired
across systems (Metadata or schema discovery, data provenance)
36, 33.1, 35.1,
35.2
Representational and
accessibility
Detect suspicious data usage: Detect unauthorized access and data leaks from data access
patterns (Data monitoring)
33.2 Representational and
accessibility
Discover and use
Provide context data (Data enrichment) 4.2, 35.3 Representational
Interpret semantics to link data sets: Semantic integration of data (Data integration and data
provenance)
20 Contextual,
representational and
accesibility
Recommend relevant data (Data integration) 30 Contextual and
accessibility
40. Machine learning techniques for data management â Martin Fadler, Tobias Pentek | 39
The evaluation shows clearly the potential, but also the
early stage of AI in data management
UTILITY
IMPLEMENTATION
Results of survey
8 participants
CC CDQ workshop 26.09.2018
CANDIDATES WITH HIGH POTENTIAL
Likert scala values:
1: Implementation: 1 (Not started), 2 (Evaluating), 3 (Prototyping), 4 (Project), 5 (Operational)
2: Utility: 1 (Very low), 2 (Low), 3 (Moderate), 4 (High), 5 (Very high)
1
2
Most of the identified usage scenarios seem to have a clear relevance and a
utility of high to very high. Still, all of the scenarios are in their beginning of
investigation. Only one scenario made it to a prototype.
W
ork
in
progress
41. Machine learning techniques for data management â Martin Fadler, Tobias Pentek | 40
The evaluation shows clearly the potential but also the
early stage of AI in data management
Likert scala values:
1 - Implementation: 1 (Not started), 2 (Evaluating),
3 (Prototyping), 4 (Project), 5 (Operational)
2 - Utility: 1 (Very low), 2 (Low), 3 (Moderate),
4 (High), 5 (Very high)
4 - 5
1 - 2
2 - 3
3 - 4
Category No. Usage scenarios Implementation Utility Difference
Acquire and create data
2.5.1a Auto-fill and typo correction by data entry 2.63 4.14 1.52
2.5.1b Data object detection in unstructured data 1.88 3.50 1.63
2.5.1c Text data translation and generation 1.00 3.33 2.33
2.5.1d Chatbots for data entry 1.50 2.86 1.36
Unify and maintain data
2.5.2a Detection and correction of errors and inconsistencies 1.43 4.57 3.14
2.5.2b Rules validation and exten-sion 2.14 4.00 1.86
2.5.2c Data matching 2.00 4.14 2.14
2.5.2d Data Enrichment 1.75 4.14 2.39
Protect and retire data
2.5.3a Sensitive information detec-tion 1.50 3.71 2.21
2.5.3b Identification of data that needs to be retired 1.50 3.57 2.07
Discover and use data
2.5.4a Key word based and semantic search 1.50 3.43 1.93
2.5.4b Chat bot to find data 1.00 3.00 2.00
2.5.4c Semantic integration of data from heterogenous sources 1.13 3.14 2.02
2.5.4d Recommendation of data 1.75 2.71 0.96
W
ork
in
progress
42. Machine learning techniques for data management â Martin Fadler, Tobias Pentek | 41
Expert Survey âMachine learning techniques for data
management implementation status and future potentialâ
If you would like to participate in the expert survey and
receive the results, please click here.
43. Machine learning techniques for data management â Martin Fadler, Tobias Pentek | 42
Learning and implications for data management
Findings
1. Machine learning has significant potential to improve data quality but will at
the same time disrupt how data is managed
2. Highly repetitive and simple cases will be automated by machine but human
needs to intervene in more difficult and complex cases
3. Redesign of work processes required:
- Machine takes over prediction
- Human judges output and confirms
4. New roles and skills required
44. Machine learning techniques for data management â Martin Fadler, Tobias Pentek | 43
ML in data management â
Outlook on CC CDQ research activities in 2019
n Impact of AI/ML on shared service center and data management processes
n Prototypical implementation of scenarios with high potential
Research
topics
n Ongoing screening of ML techniques
n Update on taxonomy and archetypes
n Survey on utility and future potential of ML techniques
Research
activities
n Provide benchmark datasets for researchIdea
45. Machine learning techniques for data management â Martin Fadler, Tobias Pentek | 44
www.cdq.ch
CDQ AG
www.cc-cdq.ch
CC CDQ Portal
www.cdq.iwi.unisg.ch
CDQ Academy
www.xing.com/net/cdqm
CC CDQ Community at XING
https://twitter.com/cdq_ag
CDQ at Twitter
https://www.linkedin.com/groups/8137247
CC CDQ Community at LinkedIn
Please reach out to us if you have any further questions or
topics to discuss
Competence Center Corporate Data Quality
martin.fadler@cdq.ch
Research associate
+41 78 405 16 80âŹ
Competence Center Corporate Data Quality
tobias.pentek@cdq.ch
Head of Community and Innovation
Tobias Pentek
Martin Fadler (Ph.D. cand.)
46. Machine learning techniques for data management â Martin Fadler, Tobias Pentek | 45
Q&A
Thank you! Questions?