These notes discuss the related topics of Data Profiling, Data Catalogs and Metadata Harmonisation. It describes a detailed structure for data profiling activities. It identifies various open source and commercial tools and data profiling algorithms. Data profiling is a necessary pre-requisite activity in order to construct a data catalog. A data catalog makes an organisation’s data more discoverable. The data collected during data profiling forms the metadata contained in the data catalog. This assists with ensuring data quality. It is also a necessary activity for Master Data Management initiatives. These notes describe a metadata structure and provide details on metadata standards and sources.
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Data Profiling, Data Catalogs and Metadata Harmonisation
1. Data Profiling, Data Catalogs
and Metadata
Harmonisation
Alan McSweeney
http://ie.linkedin.com/in/alanmcsweeney
https://www.amazon.com/dp/1797567616
2. Data Profiling, Data Catalogs and Metadata
Harmonisation
May 3, 2021 2
Data Profiling
Understand Your Data
Data Catalog
Database of Data Assets
Metadata
Harmonisation
Standardisation of Data
Descriptions
3. Data Profiling
• The preparation of data prior to it being in a usable and
analysable format can consume up to 80% of the resources
of a data project
May 3, 2021 3
4. Data Profiling
• Process for discovering and examining the data available in existing
data sources
• Essential initial activity
− Understand the structure and contents of data
− Evaluate data quality and data conformance with standards
− Identify terms and metadata used to describe data
− Identify data relationships and dependencies
− Enable the creation of a master data view across all data sources
− Understand and define data integration requirements
• Be able to understand data issues, problems, challenges at the start
of a data project:
− Data cleansing
− Data analytics
− Master data management
− Data catalog
− Data migration
May 3, 2021 4
5. Data Profiling – Wider Context
May 3, 2021 5
Source System
Data
Profiling
Common Data
Model, Data
Storage and Access
Platform
Visualisation and
Reporting
Analysis
Data Access
Data Integration
Common
Data
Integration
Understand
System Data
Structures,
Values
Profiling Assists with Building
Long-Term Data Model
Profiling Assists with Building Data
Dictionary/Catalog to Enable Data
Subject Access and Data Discovery
1
2
3
5
Assists with Data
Extraction and
Integration Definition
4
Data Catalog
Data
Virtualisation
Layer
6
Data Catalog is an
Enabler of Data
Virtualisation
6. Importance Of Data Profiling
May 3, 2021 6
• Data profiling is
a central activity
that is key to
downstream
and long-term
data usability
and has impact
on topics of
Data Quality,
Data
Lineage/Data
Provenance and
Master Data
Management
Data Profiling
Activity
Data Quality
Data
Lineage/Data
Provenance
Contributes
to and
Ensures Data
Quality
Allows
Tracking of
Data Lineage
• Data lineage and data provenance involving tracking
data origins, what happens to the data and how it flows
between systems over time
• Data lineage provides data visibility
• It simplifies tracing data errors that may occur in data
reporting, visualisation and analytics
Master Data
Management
Enabled the
Implementation of
Master Data
Management
7. Data Profiling Toolset Options – Partial List
May 3, 2021 7
• Large number of data
profiling tool options
• You can investigate
these tools to
understand which are
the most suitable and
which functions are
important prior to any
formal tool selection
process
− Download and use trial
versions of commercial
tools
• This work will require
resources
Free/Open Source Tools
Aggregate Profiler
Quadient DataCleaner
Talend Open Studio
Commercial Tools
Atlan
Collibra Data Stewardship Manager
IBM InfoSphere Information Analyser
Informatica Data Explorer
Melissa Data Profiler
Oracle Enterprise Data Quality
SAP Business Objects Data Services (BODS)
SAS DataFlux
TIBCO Clarity
WinPure
8. Data Profiling Stages
Data Access and
Retrieval
•Defining the data sources
to be profiled
Performing the
Profiling
•Working through the
programme of data
profiling activities
Understanding
and Interpreting
the Results
•Collating, documenting
and using the results
May 3, 2021 8
9. Layers Of Data Profiling Activities
• Profiling starts with
individual data
fields/columns and
then extends
outwards to
tables/files, then
data
stores/databases to
the upstream data
sources and
downstream data
targets and finally
the entire set of
organisation data
entities
May 3, 2021 9
Organisation Data
Landscape
Data Sources and
Targets
Data Store
Individual Data
Structures (Tables)
Individual Data
Fields
10. Data Profiling Across The Organisation Data
Landscape
May 3, 2021 10
Data
Profiling
Activity Data
Profiling
Activity
Data
Profiling
Activity
Data
Profiling
Activity
Data
Profiling
Activity
Data
Profiling
Activity
Data
Profiling
Activity
11. Data Profiling Across The Organisation Data
Landscape
• Data profiling activities are normally performed on single
data entities
• The organisation data landscape consists of multiple,
generally heterogenous, loosely interconnected data
entities between which where data moves
• Data breathes life into the organisation’s solution
landscape
• One profiled data entity can take its data from a number of
upstream data sources and in turn be the source for a
number of downstream data targets
• Profiling may involve tracing data lineage across a number
of data entities to create and end-to-end data provenance
May 3, 2021 11
12. Individual Data Profiling Exercise Can Leak Into
Other Data Domains
May 3, 2021 12
Data
Profiling
Activity Data
Profiling
Activity
Data
Profiling
Activity
Core Data
Profiling
Activity
Data
Profiling
Activity
Data
Profiling
Activity
Data
Profiling
Activity
Upstream Data
Profiling Activities
Downstream Data
Profiling Activities
13. Data Profiling Activities
Data Profiling
Activities
Individual Field
Analysis
Data Type, Length,
Input Validation,
Constraints
Number and Count
of Values,
Null/Missing Values,
Maxima, Minima,
Ranges,
Distributions
Data Categories,
Values, Data
Dictionaries,
Reference Sources
Data Value Patterns
Data Structures
Data Aggregations
Keys
Data Indexes
Triggers
Inter Field Linkages,
Relationships,
Correlations and
Dependencies
Unique
Combinations of
Field Values
Functional Field
Dependencies
Inclusion
Dependencies
Cross Field
Inconsistent Values
Data Completeness,
Consistency and
Accuracy
Missing and
Incomplete Series
Values and Gaps
Inconsistent Data
Values
Inaccurate Data
Values
Duplicate Values
Distribution and
Occurrence Checking
Data Context
Data Sources
Data Processing and
Transformation,
Business Rules
Data Description
and Documentation
Metadata Definition
and Creation
Data Targets and
Usage
Data Criticality
Data Security
Data Statistics
Data Capacity
Statistics
Data Usage Statistics
Data Update
Statistics
Data Growth
Statistics
Data Processing
Statistics
Data Overheads
Data Audit Logging
Data Infrastructure
Data Storage
Infrastructure
Data Locations
Data Processing
Infrastructure
Data Operations
Backup and
Recovery
Replication
Availability and,
Continuity
Data Maintenance
and Housekeeping
Activities
Service Levels
Data Incident
History
Data Technologies
Data Integration
Data Storage
Data Access
Problem
Identification and
Remediation
Identify data
Problems
Identify
Remediation
Activities
May 3, 2021 13
14. Data Profiling Activities
• This represents a set of data profiling tasks to create a
complete view of the data contents of a data entity
• This allows a realistic programme of work required to
complete the data profiling activity
− Resource requirements can be quantified
− Duration can be estimated
− Informed decision can be made on what activities to include or
exclude
May 3, 2021 14
15. Data Profiling – Individual Field Analysis
• Analyse individuals data fields or columns
− Data Type, Length, Input Validation, Constraints
• Classify the field formats
− Number and Count of Values, Null/Missing Values, Maxima,
Minima, Ranges, Distributions
• Analyse and document the field values and determine any errors and
inconsistencies
− Data Categories, Values, Data Dictionaries, Reference Sources
• Identify lists of values used in fields and their sources and determine any
− Data Value Patterns
• Seek to identify patterns in field values
May 3, 2021 15
16. Data Profiling – Data Structures
• Analyse data structures – tables or files
− Data Aggregations
• Analyse data structures – number of fields/columns, frequencies of values
across lines/rows
− Keys
• Identify data structure keys, their values, frequencies, relevance and
usefulness for data access
− Data Indexes
• Analyse data structure indexes, their values and their usefulness for data
retrieval
− Triggers
• Determine if triggers have been defined for fields and analyse their
purpose, frequency, efficiency and utility
May 3, 2021 16
17. Data Profiling – Inter Field Linkages, Relationships,
Correlations and Dependencies
• Identify relationships between fields/columns of data
structures/tables
• Relationship and dependency identification can be complex
because of data volumes and large number of data values and
combinations
− Unique Combinations of Field Values
• Identify combinations of fields/columns that uniquely identify lines/rows
− Functional Field Dependencies
• Identify circumstances where one field/volume value affects others
− Inclusion Dependencies
• Identify where some field/column values are contained in others (such as
foreign keys)
− Cross Field Inconsistent Values
• Identify field/column values across separate data structures that are
inconsistent
May 3, 2021 17
18. Data Profiling – Inter Field Linkages, Relationships,
Correlations and Dependencies
− Unique Combinations of Field
Values
• DUCC
• GORDIAN
• HCA
• HyUCC
• SWAN
− Functional Field Dependencies
• DEP-MINER
• DFD
• FDEP
• FDMINE
• FASTFDs
• FUN
• HyFD
− Inclusion Dependencies
• B&B
• BINDER
• CLIM
• DeMARCH
• MIND
• MIND2
• S-INDD
• SPIDER
• ZIGZAG
May 3, 2021 18
• There are many algorithms that can be used to simplify the activity
of identifying cross-field dependencies
• These are frequently included in data profiling tools
19. Data Profiling – Data Completeness, Consistency and
Accuracy
• Analyse data within data structures to identify any gaps
and inaccuracies
− Missing and Incomplete Series Values and Gaps
• Determine any missing values in data series
− Inconsistent Data Values
• Examine data values for inconsistencies
− Inaccurate Data Values
• Examine data values for inaccuracy
− Duplicate Values
• Identify potential duplicate values
− Distribution and Occurrence Checking
• Create and analyse data value distributions
May 3, 2021 19
20. Data Profiling – Data Context
• Analyse the wider context of the data being profiled
− Data Sources
• Identify the sources of the data (and their sources)
− Data Processing and Transformation, Business Rules
• Determine how the data is created from its sources
− Data Description and Documentation
• Describe and document the data
− Metadata Definition and Creation
• Identify any existing metadata and create/update
− Data Targets and Usage
• Identify how the data is used by downstream targets and activities
− Data Criticality
• Identify the importance and criticality of the data to business operations
− Data Security
• Identify the current and required data security and access control requirements
May 3, 2021 20
21. Data Profiling – Data Statistics
• Collect and analyse statistics on data
− Data Capacity Statistics
• Collect and analyse the volumes of data being stored in the structures within the data
entity
− Data Usage Statistics
• Collect and analyse the rate of usage of data
− Data Update Statistics
• Collect and analyse the rate, frequency and extent of data changes
− Data Growth Statistics
• Collect and analyse the current and projected rates of growth of data volumes and data
usage
− Data Processing Statistics
• Collect and analyse data processing statistics – time to update
− Data Overheads
• Collect and analyse data resource overheads associated with activities such as indexes
and log shipping,
− Data Audit Logging
• Collect and analyse details on logging configuration and on data activity and usage data
May 3, 2021 21
22. Data Profiling – Data Infrastructure
• Analyse the underlying data infrastructure including data
service providers
− Data Storage Infrastructure
• Document the current data storage infrastructure and platforms
− Data Locations
• Document the data storage locations
− Data Processing Infrastructure
• Document the infrastructure and platforms used to process data including
any performance and throughput bottlenecks
May 3, 2021 22
23. Data Profiling – Data Operations
• Analyse current data operations activities and processes and
technologies being used
− Backup and Recovery
• Document data entity backup and recovery including any testing and validation of
processes
− Replication
• Document data entity replication to other locations including any testing and validation
of processes
− Availability and, Continuity
• Document actual and desired data availability and continuity of access
− Data Maintenance and Housekeeping Activities
• Document processes and activities relating the maintenance and housekeeping of the
data entity
− Service Levels
• Document actual and desired data service levels across access and usage
− Data Incident History
• Analyse service and incident history relating to the data entity including frequency,
severity, impact and time to resolve and the impact on overall data reliability
May 3, 2021 23
24. Data Profiling – Data Technologies
• Analyse the technologies in use for the data being profiled
− Data Integration
• Document and analyse data integration technologies
− Data Storage
• Document and analyse data storage technologies
− Data Access
• Document and analyse data access technologies
May 3, 2021 24
25. Data Profiling – Problem Identification and
Remediation
• Collate information on any problems and issues identified
during the data profiling activities
− Identify Data Problems
• Document and analyse the problems and issued
− Identify Remediation Activities
• Identify remediation activities and define programme of work
May 3, 2021 25
26. Data Profiling Complexity
• Do not underestimate the complexity, effort and resources
required for data profiling
• A products can make the task easier but it is not a panacea
• Data profiling can be a continuous activity as data changes
and the target data catalog needs to be maintained and
udpdated
May 3, 2021 26
27. Data Catalog
• Set of information (metadata) containing details on organisation
information resources - datasets
• Data catalog can be static or semi-static data structure created and
maintained manually
• Metadata is structured, consistent and indexed for fast and easy access
and use
• Contains descriptions of data resources
• Enables user self-service data discovery and usage
• Provides data discovery tools and facilities
• Data catalog assists with implementing FAIR (Findable, Accessible,
Interoperable, Reusable) data
− Findable – details on data available on specific topics and subjects can be found
easily and quickly
− Accessible – underlying data can be accessed
− Interoperable – metadata ensures can be aggregated and integrated across data
types
− Reusable – detailed metadata ensures data can be reused in the future
May 3, 2021 27
28. FAIR (Findable, Accessible, Interoperable, Reusable)
• https://fairsharing.org/ - sample data collections
• https://www.go-fair.org/ - implementation of FAIR data
principles - https://www.go-fair.org/fair-principles/
• https://www.schema.org/ - contains sample metadata
schemas
• Strong academic focus but the principle can be applied
elsewhere
May 3, 2021 28
29. Data Catalog Functionality Complexity
May 3, 2021 29
Registry
•Simple registry of data
sources with links to their
location and access
mechanisms
Metadata Content
•Contains descriptions of
the contents of the data
sources
Structured and
Processable
Metadata
•Metadata is held in a
structured and queryable
format
Data Relationships
•Holds details on metadata
and data concepts/themes
with relationships between
data sources
Content and Meaning
Relationships
•Semantic mappings (visual
representation of linkages)
and relationships among
domains of different
datasets
• Data catalogs can be simple or complex
• Greater complexity requires more effort and the use of tools
• Greater complexity ensures greater data usability and usefulness
• Catalog can be constructed (semi) automatically using data profiling tools
• The data catalog must be constantly updated as data changes
30. Data Catalogs, Master Data Management, Data
Profiling And Data Quality Relationships
May 3, 2021 30
Data Catalog
Structured information about data
sources, contents and access
methods
Master Data
Management
Layer above operational systems
dynamically linking data together
Data Profiling
Discovery and documentation of
data sources, types, dictionaries,
values, relationships, usage
Data Quality
Defining, monitoring and improving
data quality, accuracy, cleansing,
consistency and fitness to use
MDM
Operationalises
the Data Catalog
Quality
Underpins
Data Catalog
MDM
Ensures
Data
Quality
Data Profiling
Necessary to Build
a Data Catalog MDM Tools
Can Automate
Data Profiling
31. Data Catalog Vocabulary (DCAT)
• See https://www.w3.org/TR/vocab-dcat-2/
• Resource Description Framework (RDF) metadata data
model
• DCAT is a standard for describing datasets in a data catalog
May 3, 2021 31
32. Related Concepts
• Business Glossary – defines terms and concepts across a
business domain providing an authoritative source for
business operations
• Data Dictionary – collection of names, definitions and
attributes about data items that are being used or stored
in a database
May 3, 2021 32
33. Data Catalog Tools
• Many commercial data catalog tools – many overlap with
master data management
• Open source options
− CKAN - https://ckan.org/
− Dataverse - https://dataverse.org/
− Invenio - https://inveniosoftware.org/
− QuiltData - https://quiltdata.com/
− Zenodo - https://zenodo.org/
− Kylo - https://kylo.io/
• Can use to test concept before investing in commercial tools
• Can also use trial version of Azure Data Catalog -
https://docs.microsoft.com/en-us/azure/data-
catalog/overview
May 3, 2021 33
34. Metadata
• Data that provides information about other data resources
that enables relevant data be discovered, understood and
managed reliably and consistently
• There are various classifications of metadata types
May 3, 2021 34
35. Possible Metadata Structure And Organisation
May 3, 2021 35
Types of
Metadata
Descriptive
Information
about the data
resource
contained in a set
of metadata
fields,
Language
How data can be
discovered
Business
What the data is,
its sources,
meaning and
relationships with
other data
Location
Ownership,
Authorship
Structural
How the data is
organised and
how versions are
maintained?
Formats,
contents,
dictionaries
Administrative
/Process
How the data
should be
managed and
administered
through its
lifecycle stages
Who can perform
what operations
on the metadata
Security and
access
restrictions and
rights
Data preservation
and retention
Legal constraints
and compliance
requirements
Statistical
Information on
actual data
creation and
usage and other
volumetrics
Reference
Sets of values for
structured
metadata fields
Content
Automatically
generated
(unstructured)
metadata from
content
Technical
Infrastructural
requirements
Exchange and
interface
requirements,
interoperability
API requirements
and usage
36. Metadata Harmonisation
• Metadata Harmonisation can mean:
1. The ability of interaction data systems to exchange their
individual sets of metadata (that may comply with
different metadata standards/approaches/schemas) and
to consistently and coherently interpret and understand
the exchanged metadata
2. The conversion of existing metadata held in different
systems to a common standard
• Harmonised metadata makes finding and comparing
information easier
May 3, 2021 36
37. Key Metadata Harmonisation Principles
• Evaluation – Source, target metadata structures/schemas and
the underlying data should be profiled before any target
metadata schema design work starts
• Matching – Match existing metadata structures involving
extraction and analysis of data from source systems
• Transformation – Map the source schemas and geometry to
the common target schema
• Validation – Assess the conformance of metadata
• Publication – Make transformed metadata schema available
• Management – Ongoing management, administration and
maintenance
May 3, 2021 37
38. Metadata Concerns
• No consistent schema and nomenclature being used
• Each system will maintain different sets of metadata
• No consistent set of values (vocabulary/dictionary/code
lists) for metadata fields
• Difficult to perform reliable comparisons across metadata
May 3, 2021 38
39. Metadata Data Catalog
• Set of information (metadata) containing details on organisation
information resources - datasets
• Data catalog can be static or semi-static data structure created and
maintained manually
• Metadata is structured, consistent and indexed for fast and easy access
and use
• Contains descriptions of data resources
• Enables user self-service data discovery and usage
• Provides data discovery tools and facilities
• Data catalog assists with implementing FAIR (Findable, Accessible,
Interoperable, Reusable) data
− Findable – details on data available on specific topics and subjects can be found
easily and quickly
− Accessible – underlying data can be accessed
− Interoperable – metadata ensures can be aggregated and integrated across data
types
− Reusable – detailed metadata ensures data can be reused in the future
May 3, 2021 39
40. Scope Of Wider Data Management
May 3, 2021 40
Data Management
Data Governance Data Architecture Management
Data Development Data Operations Management
Data Security Management Data Quality Management
Data Integration Management Reference and Master Data Management
Data Warehousing and Business Intelligence
Management
Document and Content Management
Metadata Management
41. Reference And Master Data Management
• Reference and Master Data Management is the ongoing
reconciliation and maintenance of reference data and master data
− Reference Data Management is control over defined domain values (also
known as vocabularies), including control over standardised terms, code values
and other unique identifiers, business definitions for each value, business
relationships within and across domain value lists, and the consistent, shared
use of accurate, timely and relevant reference data values to classify and
categorise data
− Master Data Management is control over master data values to enable
consistent, shared, contextual use across systems, of the most accurate,
timely, and relevant version of truth about essential business entities
• Reference data and master data provide the context for transaction
data
May 3, 2021 41
42. Reference and Master Data Management –
Definition and Goals
• Definition
− Planning, implementation, and control activities to ensure
consistency with a golden version of contextual data values
• Goals
− Provide authoritative source of reconciled, high-quality master
and reference data
− Lower cost and complexity through reuse and leverage of
standards
− Support business intelligence and information integration efforts
May 3, 2021 42
43. May 3, 2021 43
•Business Drivers
•Data Requirements Policy and
Regulations
•Standards
•Code Sets
•Master Data
•Transactional Data
Inputs
•Steering Committees
•Business Data Stewards
•Subject Matter Experts
•Data Consumers
•Standards Organisations
•Data Providers
Suppliers
•Reference Data Management
Applications
•Master Data Management
Applications
•Data Modeling Tools
•Process Modeling Tools
•Metadata Repositories
•Data Profiling Tools
•Data Cleansing Tools
•Data Integration Tools
•Business Process and Rule Engines
Change Management Tools
Tools
•Data Stewards
•Subject Matter Experts
•Data Architects
•Data Analysts
•Application Architects
•Data Governance Council
•Data Providers
•Other IT Professionals
Participants
•Master and Reference Data
Requirements
•Data Models and Documentation
•Reliable Reference and Master Data
•Golden Record Data Lineage
•Data Quality Metrics and Reports
•Data Cleansing Services
Primary Deliverables
•Reference and Master Data Quality
•Change Activity
•Issues, Costs, Volume
•Use and Re-Use
•Availability
•Data Steward Coverage
Metrics
Reference and
Master Data
Management
•Application Users
•BI and Reporting Users
•Application Developers and Architects
•Data Integration Developers and
Architects
•BI Developers and Architects
•Vendors, Customers, and Partners
Consumers
44. Reference And Master Data Management –
Principles
• Shared reference and master data belongs to the organisation, not to a
particular application or department
• Reference and master data management is an on-going data quality
improvement program; its goals cannot be achieved by one project alone
• Business data stewards are the authorities accountable for controlling
reference data values. Business data stewards work with data
professionals to improve the quality of reference and master data
• Golden data values represent the organisation’s best efforts at
determining the most accurate, current, and relevant data values for
contextual use. New data may prove earlier assumptions to be false.
Therefore, apply matching rules with caution, and ensure that any changes
that are made are reversible
• Replicate master data values only from the database of record
• Request, communicate, and, in some cases, approve of changes to
reference data values before implementation
May 3, 2021 44
45. Reference Data
• Reference data is data used to classify or categorise other
data
• Business rules usually dictate that reference data values
conform to one of several allowed values
• In all organisations, reference data exists in virtually every
database
• Reference tables link via foreign keys into other relational
database tables, and the referential integrity functions
within the database management system ensure only valid
values from the reference tables are used in other tables
May 3, 2021 45
46. Master Data
• Master data is data about the business entities that
provide context for business transactions
• Master data is the authoritative, most accurate data
available about key business entities, used to establish the
context for transactional data
• Master data values are considered golden
• Master Data Management is the process of defining and
maintaining how master data will be created, integrated,
maintained, and used throughout the enterprise
May 3, 2021 46
47. Master Data Challenges
• What are the important roles, organisations, places, and things referenced repeatedly?
• What data is describing the same person, organisation, place, or thing?
• Where is this data stored? What is the source for the data?
• Which data is more accurate? Which data source is more reliable and credible? Which data
is most current?
• What data is relevant for specific needs? How do these needs overlap or conflict?
• What data from multiple sources can be integrated to create a more complete view and
provide a more comprehensive understanding of the person, organisation, place or thing?
• What business rules can be established to automate master data quality improvement by
accurately matching and merging data about the same person, organisation, place, or
thing?
• How do we identify and restore data that was inappropriately matched and merged?
• How do we provide our golden data values to other systems across the enterprise?
• How do we identify where and when data other than the golden values is used?
May 3, 2021 47
48. Understand Reference And Master Data Integration
Needs
• Reference and master data requirements are relatively easy to
discover and understand for a single application
• Potentially much more difficult to develop an understanding of
these needs across applications, especially across the entire
organisation
• Analysing the root causes of a data quality problem usually
uncovers requirements for reference and master data
integration
• Organisations that have successfully managed reference and
master data typically have focused on one subject area at a
time
− Analyse all occurrences of a few business entities, across all physical
databases and for differing usage patterns
May 3, 2021 48
49. Define and Maintain the Data integration
Architecture
• Effective data integration architecture controls the shared access, replication, and
flow of data to ensure data quality and consistency, particularly for reference and
master data
• Without data integration architecture, local reference and master data
management occurs in application silos, inevitably resulting in redundant and
inconsistent data
• The selected data integration architecture should also provide common data
integration services
− Change request processing, including review and approval
− Data quality checks on externally acquired reference and master data
− Consistent application of data quality rules and matching rules
− Consistent patterns of processing
− Consistent metadata about mappings, transformations, programs and jobs
− Consistent audit, error resolution and performance monitoring data
− Consistent approach to replicating data
• Establishing master data standards can be a time consuming task as it may involve
multiple stakeholders
• Apply the same data standards, regardless of integration technology, to enable
effective standardisation, sharing, and distribution of reference and master data
May 3, 2021 49
50. Data Integration Services Architecture
May 3, 2021 50
Data Quality Management
Metadata Management
Integration
Metadata
Job Flow and
Statistics
Data Acquisition, File
Management and
Audit
Replication
Management
Data Standardisation
Cleansing and
Matching
Business
Metadata
Source Data
Archives
Rules
Errors
Staging
Reconciled
Master Data
Subscriptions
51. Implement Reference And Master Data
Management Solutions
• Reference and master data management solutions are
complex
• Given the variety, complexity, and instability of
requirements, no single solution or implementation
project is likely to meet all reference and master data
management needs
• Organisations should expect to implement reference and
master data management solutions iteratively and
incrementally through several related projects and phases
May 3, 2021 51
52. Define And Maintain Match Rules
• Matching, merging, and linking of data from multiple systems
about the same person, group, place, or thing is a major master
data management challenge
• Matching attempts to remove redundancy, to improve data
quality, and provide information that is more comprehensive
• Data matching is performed by applying inference rules
− Duplicate identification match rules focus on a specific set of fields that
uniquely identify an entity and identify merge opportunities without
taking automatic action
− Match-merge rules match records and merge the data from these
records into a single, unified, reconciled, and comprehensive record
− Match-link rules identify and cross-reference records that appear to
relate to a master record without updating the content of the cross-
referenced record
May 3, 2021 52
53. Vocabulary Management And Reference Data
• A vocabulary is a collection of terms / concepts and their
relationships
• Vocabulary management is defining, sourcing, importing,
and maintaining a vocabulary and its associated reference
data
− See ANSI/NISO Z39.19 - Guidelines for the Construction, Format,
and Management of Monolingual Controlled Vocabularies -
http://www.niso.org/kst/reports/standards?step=2&gid=&project
_key=7cc9b583cb5a62e8c15d3099e0bb46bbae9cf38a
• Vocabulary management requires the identification of the
standard list of preferred terms and their synonyms
• Vocabulary management requires data governance,
enabling data stewards to assess stakeholder needs
May 3, 2021 53
54. Vocabulary Management And Reference Data
• Key questions to ask to enable vocabulary management
− What information concepts (data attributes) will this vocabulary
support?
− Who is the audience for this vocabulary? What processes do they
support, and what roles do they play?
− Why is the vocabulary needed? Will it support applications, content
management, analytics, and so on?
− Who identifies and approves the preferred vocabulary and vocabulary
terms?
− What are the current vocabularies different groups use to classify this
information? Where are they located? How were they created? Who
are their subject matter experts? Are there any security or privacy
concerns for any of them?
− Are there existing standards that can be leveraged to fulfil this need?
Are there concerns about using an external standard vs. internal? How
frequently is the standard updated and what is the degree of change of
each update? Are standards accessible in an easy to import / maintain
format in a cost efficient manner?
May 3, 2021 54
55. Defining Golden Master Data Values
• Golden data values are the data values thought to be the most
accurate, current, and relevant for shared, consistent use
across applications
• Determine golden values by analysing data quality, applying
data quality rules and matching rules, and incorporating data
quality controls into the applications that acquire, create, and
update data
• Establish data quality measurements to set expectations,
measure improvements, and help identify root causes of data
quality problems
• Assess data quality through a combination of data profiling
activities and verification against adherence to business rules
• Once the data is standardised and cleansed, the next step is to
attempt reconciliation of redundant data through application
of matching rules
56. Define And Maintain Hierarchies And Affiliations
• Vocabularies and their associated reference data sets are
often more than lists of preferred terms and their
synonyms
• Affiliation management is the establishment and
maintenance of relationships between master data
records
57. Plan And Implement Integration Of New Data
Sources
• Integrating new reference data sources involves
− Receiving and responding to new data acquisition requests from
different groups
− Performing data quality assessment services using data cleansing
and data profiling tools
− Assessing data integration complexity and cost
− Piloting the acquisition of data and its impact on match rules
− Determining who will be responsible for data quality
− Finalising data quality metrics
58. Replicate And Distribute Reference And Master Data
• Reference and master data may be read directly from a
database of record, or may be replicated from the
database of record to other application databases for
transaction processing, and data warehouses for business
intelligence
• Reference data most commonly appears as pick list values
in applications
• Replication aids maintenance of referential integrity
59. Manage Changes To Reference And Master Data
• Specific individuals have the role of a business data
steward with the authority to create, update, and retire
reference data
• Formally control changes to controlled vocabularies and
their reference data sets
• Carefully assess the impact of reference data changes
60. Data Governance And MDM Success Factors
• Master Data Management will support business by providing a
strategy, governance policies and technologies for customer,
product, and entitlement information by following the Master Data
Management Guiding Principles
− Master data management will use (and where needed create) a “single version
of the truth” for customer, product, and asset entitlement master data
consolidated into a single master data system
− Master data management will establish standard data definition and usage will
be consistent to simplify business processes across enterprise systems
− Master data management systems and processes will be flexible and adaptable
to handle domestic and global expansion to support growth in both established
and emerging markets
− Master data management will adhere to a standards governance process to
ensure key data elements are created, maintained, cleansed and converted to
be syndicated across enterprise systems
− Master data management will identify responsibilities and monitor
accountability for customer, product, and entitlement information
Master data management will facilitate cross-functional collaboration and
manage continuous improvement of master data for customer, product, and
entitlement domains
61. Data Governance is Not A Choice – It Is A Necessity
• “We’ve got to stop having the ‘who owns the data?’
conversation.”
• “We can’t do MDM if we don’t formalise decision-making
processes around our enterprise information.”
• “Fixing the data in a single system is pointless; we don’t
know what the rules are across our systems.”
• “Everyone agrees data quality is poor, but no one can
agree on how to fix it.”
• “Are you kidding? We have multiple versions of the single-
version-of-the truth.”
62. MDM Program Critical Success Factors
• Strategy
− Drive and promote alignment with corporate strategic initiatives and pillar specific goals
− Definition of criteria and core attributes that define domains and related objects
• Solution
− Alignment with corporate strategic initiatives and pillar specific goals
− Identification of “Quick Wins” that have measurable impact
− Clear definition of metrics for measuring data improvement
− Leading industry practices have been incorporated solution design
• Governance
− Executive Ownership and Governance organisation has been rationalised established to
address federated data management needs
− Data Quality is addressed at all points of processes, as well as customer and product
lifecycles
• End-to-end Roadmap
− Prioritised program roadmap for “Quick Wins”
− Prioritised program roadmap for CDM strategic initiatives
− Fully vetted CBA for each roadmap item
− “No Regrets” actions are rationalised and aligned strategic roadmap