SlideShare ist ein Scribd-Unternehmen logo
1 von 7
InfoSphere Information Analyzer
Data quality assessment, analysis and monitoring
Information Analyzer is an IBM product and a tool widely used for data profiling. Information Analyzer
helps in understanding data structure, format, relationships and quality monitoring. Information
Analyzer is also referred to as WebSphere Information Analyzer.
Information Analyzer has extensive data profiling capabilities. It is available with a user interface that
includes a set of controls designed for integrating development work flow. The four major data profiling
functions within Information Analyzer are:
Column Analysis: Generates a full-frequency distribution and examines column values to infer properties
and definitions such as statistical measures and domain values.
Primary Key Analysis: Identifies candidate keys for one or more tables and aids in testing column
combinations or columns to determine whether the candidate is suitable for forming a primary key
Foreign Key Analysis: Examines the relationships and contents across tables, thereby identifying foreign
key and referential check integrity.
Cross-Domain Analysis: Identifies overlap in values between columns and redundancy of data between
tables.
IBM InfoSphere Information Analyzer provides data quality assessment, data quality monitoring and
data rule design and analysis capabilities. This software helps you derive more meaning from your
enterprise data, reduces the risk of proliferating incorrect information, facilitates the delivery of trusted
content, and helps to lower data integration costs.
InfoSphere Information Analyzer features include:
• Advanced analysis and monitoring provides source system profiling and analysis capabilities to help
you classify and assess your data.
• Integrated rules analysis uses data quality rules for greater validation, trending and pattern analysis.
• Scalable, collaborative platform enables sharing of information and results across the enterprise.
• Support for heterogeneous data enables you to assess information over a wide range of systems
and data sources.
Methodology and best practices
IBM InfoSphere Information Analyzer to understand the content, structure, and overall quality of your
data at a given point in time.
The analysis methodology and best practices provides a deeper insight into the analytical methods
employed by IBM InfoSphere Information Analyzer to analyze source data and rules.
The information is organized by analytical function. It gives you both in-depth knowledge and best
practices for:
• Data analysis, including:
o Applying data analysis system functionality
o Applying data analysis techniques within a function
o Interpreting data analysis results
o Making decisions or taking actions based on analytical results
• Data quality analysis and monitoring, including:
o Supporting business-driven rule definition and organization
o Applying rules and reusing consistently across data sources
o Leveraging multi-level rule analysis to understand broader data quality issues
o Evaluating rules against defined benchmarks/thresholds
o Assessing and annotating data quality results
o Monitoring trends in data quality over time
o Deploying rules across environments
o Running adhoc, scheduled, or command line execution options
Analyzing data by using data rules
The topics in this section describe how to define and execute data rules, which evaluate or validate
specific conditions associated with your data sources. Data rules can be used to extend your data
profiling analysis, to test and evaluate data quality, or to improve your understanding of data integration
requirements.
To work with data rules, you will always start by going to the Develop navigator menu in the console,
and select Data Quality. This will get you to the starting point of creating and working with data rule
functionality.
From the Data Quality workspace you can:
• Create data rule definitions, rule set definitions, data rules, rule sets, and metrics
• Build data rule definition, rule set definition, and metric logic
• Create data rule definition and rule set definition associations
• Associate a data rule definition, rule set definition, metric, data rule, or rule set with folders
• Associate a data rule definition, rule set definition, metric, data rule, or rule set with IBM®
InfoSphere™ Business Glossary terms, policies, and contacts
• Build data rule definitions or rule set definitions by using the rule builder
• Add a data rule definition with the free form editor
Characteristics of data rule functionality
You can use data rules to evaluate and analyze conditions found during data profiling, to conduct a data
quality assessment, to provide more information to a data integration effort, or to establish a
framework for validating and measuring data quality over time.
You can construct data rules in a generic fashion through the use of rule definitions. These definitions
describe the rule evaluation or condition. By associating physical data sources to the definition, a data
rule can be run to return analysis statistics and detail results.
Creating a rule definition
Creating a rule definition requires two components: a name for the rule definition and a logical
statement (the rule logic) about what the rule definition tests or evaluates. Incomplete, empty, or
invalid data values affect the quality of the data in your project by interrupting data integration
processes and by using up memory on source systems. You can create rule definitions to analyze data
for completeness and validity to find these anomalies.
You can create a rule definition by defining the name and description of the rule, and by using the free
form editor or rule logic builder to complete the rule logic for the rule definition.
Procedure
1. From the Develop icon in the Navigator menu in the console, select Data Quality.
2. Click New Rule Definition in the Tasks list, located on the right side of the screen.
3. Enter a name for the new rule definition in the Name field.
4. Optional: Enter a brief description of your rule definition in the Short Description field.
5. Optional: Enter a longer description of your rule definition in the Long Description field.
6. Optional: In the Validity Benchmark section, check Include Benchmark to set benchmarks to
check the validity of your data.
7. Click Save.
Generating a data rule from a rule definition
After you create rule definition logic, you can create a data rule to analyze real data in your projects.
Procedure
1. From the Develop icon on the Navigator menu, select Data Quality.
2. Highlight the rule definition that you want to generate a data rule from.
3. In the Tasks menu on the right side of the screen, click Generate Data Rule or Rule Set.
4. On the Overview tab, type a name for the data rule. The name must contain at least one
character and cannot contain the slash () character. The name must be unique in your project.
5. Optional: Type a short description and long description of your data rule. The Created
By, Created On, and Last Modified fields are automatically populated after you create and save
your data rule. You can optionally provide information into the Owner and Data Stewardfields.
6. Decide if you would like to set a validity benchmark for your data rule. Benchmarks quantify the
quality of your data, as well as monitor your data. Click the Monitor Records Flagged by One or
More Rules check box in the Validity Benchmark box, if you would like to monitor records that
are marked for other rules in your project.
7. At the top of the workspace, switch from the Overview tab to the Bindings and Output tab.
8. Click Save to create the data rule.
Setting benchmarks for data rules
You can set a validity benchmark either when you initially create a rule definition, or when you generate
the data rule, in order to quantify the quality of your data, as well as monitor your data.
Validity benchmark
The validity benchmark establishes the level or tolerance you have for exceptions to the data rule. The
benchmark indicates whether sufficient records have met or not met the rule in order to mark a specific
execution of the rule as having passed or failed to meet the benchmark.
Select Monitor records that do not meet one or more rules in the data rule workspace.
You can define the validity benchmark by using the following options that can be found in the menu in
the validity benchmark workspace. Start by selecting one of the following options:
% Not Met
Determines the percentage of records that did not meet the rule logic in the data rule. You can
set the benchmark to display a pass or fail condition when this value is greater than, less than,
or equal to a reference value that you specify. For example, to ensure that the percentage of
records that do not meet a data rule never exceeds or falls below 10%, you would set the
benchmark to "% Not Met % <= 10."
# Not Met
Determines the number of records that did not meet the rule logic in your data rule. You can set
the benchmark to display a pass or fail condition when this value is greater than, less than, or
equal to a reference value that you specify. For example, to ensure that the percentage of
records that do not meet a data rule never exceeds or falls below 1000, you would set the
benchmark to "# Not Met <= 1000."
% Met
Determines the percentage of records that meet the rule logic in your data rule. You can set the
benchmark to display a pass or fail condition when this value is greater than, less than, or equal
to a reference value that you specify. For example, to ensure that the percentage of records that
meet the data rule never falls below 90%, you would set the benchmark to "Met % >= 90."
# Met
Determines the number of records that meet the rule logic in your data rule. You can set the
benchmark to display a pass or fail condition when this value is greater than, less than, or equal
to a reference value that you specify. For example, to ensure that the number of records that
meet the data rule never falls below 9000, you would set the benchmark to "Met # >= 9000."
Creating a rule set definition
To create a rules set definition, select two or more data rule definitions or data rules and add them to
the rule set. When a rule set is executed, the data will be evaluated based on the conditions of all rule
definitions and data rules included in the rule set.
A rule set definition allows you to define a series of data rule definitions as one combined rule
definition. After you define your rule set definition, you generate a rule set out of the rule set definition.
When you generate a rule set, you bind all of the variables from your data rule definitions, such as
"first_name" and "column_a," to actual data in your data sources, such as "Fred" or "division_codes."
Your representational rule set definition elements are generated as a rule set that is bound to real data.
Once your rule set is generated, you run the rule set to gather information on the data in your projects.
Procedure
1. From the Develop Navigator menu in the console, select Data Quality.
2. Click New Rule Set Definition in the Tasks list, located on the right side of the screen.
3. Enter a name for the new rule in the Name field.
4. Optional: Enter a brief description of your data rule in the Short Description field.
5. Optional: Enter a longer description of your data rule in the Long Description field.
6. Optional: Select any benchmarks that you want to set for the rule set definition. You can set
a Validity Benchmark, Confidence Benchmark, or Baseline Comparison Benchmark.
7. Click Save.
Creating a metric
You can create a metric, which is an equation that you define, in order to develop a measurement you
can apply against data rules, rule sets, and other metrics.
You can create metrics to establish a set of key performance indicators (KPI) around the data quality of
the sources that are being evaluated. You can use metrics to aggregate the results of multiple rules and
rule sets to provide you with a higher level of key performance indicators across multiple sources.
Procedure
1. From the Develop Navigator menu in the console, select Data Quality.
2. Click New Metric in the Tasks pane.
3. Required: In the Name field, type a name for the metric.
4. Optional: Provide a short and long description.
5. Optional: In the Validity Benchmark section, select Include Benchmark to set benchmarks to
check the validity of your data.
6. To associate the metric with a folder:
a. Select the Folders view.
b. Click Add. You can search for folders by name and select folders to be associated with
the metric.
7. To develop the logic for the new metric, select from a variety of predefined metric combinations
to build logic for the metric:
a. Click the Measures tab.
b. Select an opening parenthesis if you are grouping lines of logic to form a single
condition.
c. Compose a metric expression that can include rules, rule sets, metric executables, and
functions from the Quality Control and Functions tabs on the Expression palette.
d. Select a closing parenthesis if you are grouping lines of logic to form a single condition.
e. Select a Boolean operator.
8. Save the new metric.
Advanced analysis and monitoring
• Enables users to easily classify data, display data using semantics, validate column/table
relationships and move to exception rows for further analysis.
• Provides data quality assessment functions such as column, primary key, foreign key, cross-domain
and baseline analysis, and offers 80 configurable reports for visualizing analysis and trends.
• Uses the IBM Information Server scheduling service to allow scheduled execution of profiling, rules
and metrics.
• Provides auditing, tracking and monitoring of data quality conditions over time to support data
governance initiatives.
• Uses project-, role- and user-based approaches to control access to sensitive information, including
the ability to restrict access to original data sources.
Integrated rules analysis
• Provides common data rules to perform trending, pattern analysis and establish baselines
consistently over data sources.
• Offers multiple-level rules analysis (by rule, record, pattern) for evaluating data issues by record
rather than in isolation.
• Provides pre-packaged data validation rules to reduce development time.
• Offers exception-based management of business rules and transformations.
Scalable, collaborative platform
• Provides native parallel execution for enterprise scalability to support large volumes of data.
• Supports multiple analytical reviews and asynchronous profiling to allow more than one user to
work in a project-based context.
• Uses virtual tables and columns for analyzing data without requiring changes to a host database.
• Provides annotations to enable users to add their business names, descriptions, business terms and
other attributes to tables, columns and rules.
Support for heterogeneous data
• Uses open database connectivity (ODBC) or native connectivity to profile IBM DB2, IBM Informix,
Oracle, Microsoft SQL Server, Sybase, Microsoft Access, Teradata and other data sources such as
text files.
• Allows reuse and sharing of data rules in IBM InfoSphere DataStage through IBM InfoSphere
QualityStage and InfoSphere Information Analyzer to help you align data quality metrics throughout
the project lifecycle.
• Uses metadata to allow analytical results to be shared across all IBM InfoSphere Information Server
modules.
• Integrates with IBM InfoSphere Metadata Workbench and IBM InfoSphere Business Glossary.
• Integrates with IBM InfoSphere Information Analyzer for Linux on System z®, allowing you to
perform data quality functions directly on the mainframe.

Weitere ähnliche Inhalte

Was ist angesagt?

ETL Testing Training Presentation
ETL Testing Training PresentationETL Testing Training Presentation
ETL Testing Training PresentationApurba Biswas
 
Data base and data entry presentation by mj n somya
Data base and data entry presentation by mj n somyaData base and data entry presentation by mj n somya
Data base and data entry presentation by mj n somyaMukesh Jaiswal
 
Data quality architecture
Data quality architectureData quality architecture
Data quality architectureanicewick
 
Etl process in data warehouse
Etl process in data warehouseEtl process in data warehouse
Etl process in data warehouseKomal Choudhary
 
Data Quality Technical Architecture
Data Quality Technical ArchitectureData Quality Technical Architecture
Data Quality Technical ArchitectureHarshendu Desai
 
iEHR.eu IHIC 2012 Paper
iEHR.eu IHIC 2012 PaperiEHR.eu IHIC 2012 Paper
iEHR.eu IHIC 2012 Paperiehreu
 
Introduction to ETL process
Introduction to ETL process Introduction to ETL process
Introduction to ETL process Omid Vahdaty
 
Editing, cleaning and coding of data in Business research methodology
Editing, cleaning and coding of data in Business research methodology Editing, cleaning and coding of data in Business research methodology
Editing, cleaning and coding of data in Business research methodology VaishaghMp
 
Data Cleaning Service for Data Warehouse: An Experimental Comparative Study o...
Data Cleaning Service for Data Warehouse: An Experimental Comparative Study o...Data Cleaning Service for Data Warehouse: An Experimental Comparative Study o...
Data Cleaning Service for Data Warehouse: An Experimental Comparative Study o...TELKOMNIKA JOURNAL
 
Cognos framework manager
Cognos framework managerCognos framework manager
Cognos framework managermaxonlinetr
 
ETL Testing Interview Questions and Answers
ETL Testing Interview Questions and AnswersETL Testing Interview Questions and Answers
ETL Testing Interview Questions and AnswersH2Kinfosys
 
Data Verification In QA Department Final
Data Verification In QA Department FinalData Verification In QA Department Final
Data Verification In QA Department FinalWayne Yaddow
 
Research trends in data warehousing and data mining
Research trends in data warehousing and data miningResearch trends in data warehousing and data mining
Research trends in data warehousing and data miningEr. Nawaraj Bhandari
 
Dataflux Training syllabus Dataflux management studio training syllabus ,Dat...
Dataflux Training  syllabus Dataflux management studio training syllabus ,Dat...Dataflux Training  syllabus Dataflux management studio training syllabus ,Dat...
Dataflux Training syllabus Dataflux management studio training syllabus ,Dat...bidwhm
 

Was ist angesagt? (20)

ETL Testing Training Presentation
ETL Testing Training PresentationETL Testing Training Presentation
ETL Testing Training Presentation
 
Data base and data entry presentation by mj n somya
Data base and data entry presentation by mj n somyaData base and data entry presentation by mj n somya
Data base and data entry presentation by mj n somya
 
Data quality
Data qualityData quality
Data quality
 
Data quality architecture
Data quality architectureData quality architecture
Data quality architecture
 
Etl testing
Etl testingEtl testing
Etl testing
 
ETL QA
ETL QAETL QA
ETL QA
 
Etl process in data warehouse
Etl process in data warehouseEtl process in data warehouse
Etl process in data warehouse
 
Data Quality Technical Architecture
Data Quality Technical ArchitectureData Quality Technical Architecture
Data Quality Technical Architecture
 
iEHR.eu IHIC 2012 Paper
iEHR.eu IHIC 2012 PaperiEHR.eu IHIC 2012 Paper
iEHR.eu IHIC 2012 Paper
 
Introduction to ETL process
Introduction to ETL process Introduction to ETL process
Introduction to ETL process
 
Editing, cleaning and coding of data in Business research methodology
Editing, cleaning and coding of data in Business research methodology Editing, cleaning and coding of data in Business research methodology
Editing, cleaning and coding of data in Business research methodology
 
Data Cleaning Service for Data Warehouse: An Experimental Comparative Study o...
Data Cleaning Service for Data Warehouse: An Experimental Comparative Study o...Data Cleaning Service for Data Warehouse: An Experimental Comparative Study o...
Data Cleaning Service for Data Warehouse: An Experimental Comparative Study o...
 
Data warehouse physical design
Data warehouse physical designData warehouse physical design
Data warehouse physical design
 
Cognos framework manager
Cognos framework managerCognos framework manager
Cognos framework manager
 
ETL Testing Interview Questions and Answers
ETL Testing Interview Questions and AnswersETL Testing Interview Questions and Answers
ETL Testing Interview Questions and Answers
 
Data Verification In QA Department Final
Data Verification In QA Department FinalData Verification In QA Department Final
Data Verification In QA Department Final
 
Optim Archive
Optim ArchiveOptim Archive
Optim Archive
 
Research trends in data warehousing and data mining
Research trends in data warehousing and data miningResearch trends in data warehousing and data mining
Research trends in data warehousing and data mining
 
Data Cleaning
Data CleaningData Cleaning
Data Cleaning
 
Dataflux Training syllabus Dataflux management studio training syllabus ,Dat...
Dataflux Training  syllabus Dataflux management studio training syllabus ,Dat...Dataflux Training  syllabus Dataflux management studio training syllabus ,Dat...
Dataflux Training syllabus Dataflux management studio training syllabus ,Dat...
 

Andere mochten auch

IBM InfoSphere Stewardship Center for iis dqec
IBM InfoSphere Stewardship Center for iis dqecIBM InfoSphere Stewardship Center for iis dqec
IBM InfoSphere Stewardship Center for iis dqecIBMInfoSphereUGFR
 
IBM Cognos - IBM informations-integration för IBM Cognos användare
IBM Cognos - IBM informations-integration för IBM Cognos användareIBM Cognos - IBM informations-integration för IBM Cognos användare
IBM Cognos - IBM informations-integration för IBM Cognos användareIBM Sverige
 
IBM InfoSphere Data Architect 9.1 - Francis Arnaudiès
IBM InfoSphere Data Architect 9.1 - Francis ArnaudièsIBM InfoSphere Data Architect 9.1 - Francis Arnaudiès
IBM InfoSphere Data Architect 9.1 - Francis ArnaudièsIBMInfoSphereUGFR
 
Installation and Setup for IBM InfoSphere Streams V4.0
Installation and Setup for IBM InfoSphere Streams V4.0Installation and Setup for IBM InfoSphere Streams V4.0
Installation and Setup for IBM InfoSphere Streams V4.0lisanl
 
Présentation IBM InfoSphere MDM 11.3
Présentation IBM InfoSphere MDM 11.3Présentation IBM InfoSphere MDM 11.3
Présentation IBM InfoSphere MDM 11.3IBMInfoSphereUGFR
 
Présentation IBM InfoSphere Information Server 11.3
Présentation IBM InfoSphere Information Server 11.3Présentation IBM InfoSphere Information Server 11.3
Présentation IBM InfoSphere Information Server 11.3IBMInfoSphereUGFR
 
Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profilingShailja Khurana
 

Andere mochten auch (7)

IBM InfoSphere Stewardship Center for iis dqec
IBM InfoSphere Stewardship Center for iis dqecIBM InfoSphere Stewardship Center for iis dqec
IBM InfoSphere Stewardship Center for iis dqec
 
IBM Cognos - IBM informations-integration för IBM Cognos användare
IBM Cognos - IBM informations-integration för IBM Cognos användareIBM Cognos - IBM informations-integration för IBM Cognos användare
IBM Cognos - IBM informations-integration för IBM Cognos användare
 
IBM InfoSphere Data Architect 9.1 - Francis Arnaudiès
IBM InfoSphere Data Architect 9.1 - Francis ArnaudièsIBM InfoSphere Data Architect 9.1 - Francis Arnaudiès
IBM InfoSphere Data Architect 9.1 - Francis Arnaudiès
 
Installation and Setup for IBM InfoSphere Streams V4.0
Installation and Setup for IBM InfoSphere Streams V4.0Installation and Setup for IBM InfoSphere Streams V4.0
Installation and Setup for IBM InfoSphere Streams V4.0
 
Présentation IBM InfoSphere MDM 11.3
Présentation IBM InfoSphere MDM 11.3Présentation IBM InfoSphere MDM 11.3
Présentation IBM InfoSphere MDM 11.3
 
Présentation IBM InfoSphere Information Server 11.3
Présentation IBM InfoSphere Information Server 11.3Présentation IBM InfoSphere Information Server 11.3
Présentation IBM InfoSphere Information Server 11.3
 
Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profiling
 

Ähnlich wie InfoSphere_Information_Analyzer

Data Profiling and Quality Assurance with Great Expectations.pptx
Data Profiling and Quality Assurance with Great Expectations.pptxData Profiling and Quality Assurance with Great Expectations.pptx
Data Profiling and Quality Assurance with Great Expectations.pptxKnoldus Inc.
 
Data quality testing – a quick checklist to measure and improve data quality
Data quality testing – a quick checklist to measure and improve data qualityData quality testing – a quick checklist to measure and improve data quality
Data quality testing – a quick checklist to measure and improve data qualityJaveriaGauhar
 
Database and Data Warehousing-Building Business Intelligence
Database and Data Warehousing-Building Business IntelligenceDatabase and Data Warehousing-Building Business Intelligence
Database and Data Warehousing-Building Business IntelligenceYeng Ferraris Portes
 
Top 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfTop 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfShaikSikindar1
 
Techniques for effective test data management in test automation.pptx
Techniques for effective test data management in test automation.pptxTechniques for effective test data management in test automation.pptx
Techniques for effective test data management in test automation.pptxKnoldus Inc.
 
Architecting the Framework for Compliance & Risk Management
Architecting the Framework for Compliance & Risk ManagementArchitecting the Framework for Compliance & Risk Management
Architecting the Framework for Compliance & Risk Managementjadams6
 
Computer aided audit techniques (CAAT) sourav mathur
Computer aided audit techniques (CAAT)  sourav mathurComputer aided audit techniques (CAAT)  sourav mathur
Computer aided audit techniques (CAAT) sourav mathursourav mathur
 
Testing Data Analysis Framework - A Case Study_orig.pptx
Testing Data Analysis Framework - A Case Study_orig.pptxTesting Data Analysis Framework - A Case Study_orig.pptx
Testing Data Analysis Framework - A Case Study_orig.pptxAgile Testing Alliance
 
Machine Learning in Autonomous Data Warehouse
 Machine Learning in Autonomous Data Warehouse Machine Learning in Autonomous Data Warehouse
Machine Learning in Autonomous Data WarehouseSandesh Rao
 
10 tips-for-optimizing-sql-server-performance-white-paper-22127
10 tips-for-optimizing-sql-server-performance-white-paper-2212710 tips-for-optimizing-sql-server-performance-white-paper-22127
10 tips-for-optimizing-sql-server-performance-white-paper-22127Kaizenlogcom
 
What is Data Observability.pdf
What is Data Observability.pdfWhat is Data Observability.pdf
What is Data Observability.pdf4dalert
 
Specifying data requirments
Specifying data requirmentsSpecifying data requirments
Specifying data requirmentsImran60577
 
Data Processing & Explain each term in details.pptx
Data Processing & Explain each term in details.pptxData Processing & Explain each term in details.pptx
Data Processing & Explain each term in details.pptxPratikshaSurve4
 
Introduction to Machine Learning and Data Science using Autonomous Database ...
Introduction to Machine Learning and Data Science using Autonomous Database  ...Introduction to Machine Learning and Data Science using Autonomous Database  ...
Introduction to Machine Learning and Data Science using Autonomous Database ...Sandesh Rao
 
Data quality and bi
Data quality and biData quality and bi
Data quality and bijeffd00
 
Data Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable TestingData Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable TestingKnoldus Inc.
 
20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vsIan Feller
 

Ähnlich wie InfoSphere_Information_Analyzer (20)

Data Profiling and Quality Assurance with Great Expectations.pptx
Data Profiling and Quality Assurance with Great Expectations.pptxData Profiling and Quality Assurance with Great Expectations.pptx
Data Profiling and Quality Assurance with Great Expectations.pptx
 
Data quality testing – a quick checklist to measure and improve data quality
Data quality testing – a quick checklist to measure and improve data qualityData quality testing – a quick checklist to measure and improve data quality
Data quality testing – a quick checklist to measure and improve data quality
 
Database and Data Warehousing-Building Business Intelligence
Database and Data Warehousing-Building Business IntelligenceDatabase and Data Warehousing-Building Business Intelligence
Database and Data Warehousing-Building Business Intelligence
 
Data quality
Data qualityData quality
Data quality
 
Top 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfTop 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdf
 
Techniques for effective test data management in test automation.pptx
Techniques for effective test data management in test automation.pptxTechniques for effective test data management in test automation.pptx
Techniques for effective test data management in test automation.pptx
 
Architecting the Framework for Compliance & Risk Management
Architecting the Framework for Compliance & Risk ManagementArchitecting the Framework for Compliance & Risk Management
Architecting the Framework for Compliance & Risk Management
 
Business analyst
Business analystBusiness analyst
Business analyst
 
UNIT 4.pptx
UNIT 4.pptxUNIT 4.pptx
UNIT 4.pptx
 
Computer aided audit techniques (CAAT) sourav mathur
Computer aided audit techniques (CAAT)  sourav mathurComputer aided audit techniques (CAAT)  sourav mathur
Computer aided audit techniques (CAAT) sourav mathur
 
Testing Data Analysis Framework - A Case Study_orig.pptx
Testing Data Analysis Framework - A Case Study_orig.pptxTesting Data Analysis Framework - A Case Study_orig.pptx
Testing Data Analysis Framework - A Case Study_orig.pptx
 
Machine Learning in Autonomous Data Warehouse
 Machine Learning in Autonomous Data Warehouse Machine Learning in Autonomous Data Warehouse
Machine Learning in Autonomous Data Warehouse
 
10 tips-for-optimizing-sql-server-performance-white-paper-22127
10 tips-for-optimizing-sql-server-performance-white-paper-2212710 tips-for-optimizing-sql-server-performance-white-paper-22127
10 tips-for-optimizing-sql-server-performance-white-paper-22127
 
What is Data Observability.pdf
What is Data Observability.pdfWhat is Data Observability.pdf
What is Data Observability.pdf
 
Specifying data requirments
Specifying data requirmentsSpecifying data requirments
Specifying data requirments
 
Data Processing & Explain each term in details.pptx
Data Processing & Explain each term in details.pptxData Processing & Explain each term in details.pptx
Data Processing & Explain each term in details.pptx
 
Introduction to Machine Learning and Data Science using Autonomous Database ...
Introduction to Machine Learning and Data Science using Autonomous Database  ...Introduction to Machine Learning and Data Science using Autonomous Database  ...
Introduction to Machine Learning and Data Science using Autonomous Database ...
 
Data quality and bi
Data quality and biData quality and bi
Data quality and bi
 
Data Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable TestingData Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable Testing
 
20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs
 

InfoSphere_Information_Analyzer

  • 1. InfoSphere Information Analyzer Data quality assessment, analysis and monitoring Information Analyzer is an IBM product and a tool widely used for data profiling. Information Analyzer helps in understanding data structure, format, relationships and quality monitoring. Information Analyzer is also referred to as WebSphere Information Analyzer. Information Analyzer has extensive data profiling capabilities. It is available with a user interface that includes a set of controls designed for integrating development work flow. The four major data profiling functions within Information Analyzer are: Column Analysis: Generates a full-frequency distribution and examines column values to infer properties and definitions such as statistical measures and domain values. Primary Key Analysis: Identifies candidate keys for one or more tables and aids in testing column combinations or columns to determine whether the candidate is suitable for forming a primary key Foreign Key Analysis: Examines the relationships and contents across tables, thereby identifying foreign key and referential check integrity. Cross-Domain Analysis: Identifies overlap in values between columns and redundancy of data between tables. IBM InfoSphere Information Analyzer provides data quality assessment, data quality monitoring and data rule design and analysis capabilities. This software helps you derive more meaning from your enterprise data, reduces the risk of proliferating incorrect information, facilitates the delivery of trusted content, and helps to lower data integration costs. InfoSphere Information Analyzer features include: • Advanced analysis and monitoring provides source system profiling and analysis capabilities to help you classify and assess your data. • Integrated rules analysis uses data quality rules for greater validation, trending and pattern analysis. • Scalable, collaborative platform enables sharing of information and results across the enterprise. • Support for heterogeneous data enables you to assess information over a wide range of systems and data sources. Methodology and best practices IBM InfoSphere Information Analyzer to understand the content, structure, and overall quality of your data at a given point in time.
  • 2. The analysis methodology and best practices provides a deeper insight into the analytical methods employed by IBM InfoSphere Information Analyzer to analyze source data and rules. The information is organized by analytical function. It gives you both in-depth knowledge and best practices for: • Data analysis, including: o Applying data analysis system functionality o Applying data analysis techniques within a function o Interpreting data analysis results o Making decisions or taking actions based on analytical results • Data quality analysis and monitoring, including: o Supporting business-driven rule definition and organization o Applying rules and reusing consistently across data sources o Leveraging multi-level rule analysis to understand broader data quality issues o Evaluating rules against defined benchmarks/thresholds o Assessing and annotating data quality results o Monitoring trends in data quality over time o Deploying rules across environments o Running adhoc, scheduled, or command line execution options Analyzing data by using data rules The topics in this section describe how to define and execute data rules, which evaluate or validate specific conditions associated with your data sources. Data rules can be used to extend your data profiling analysis, to test and evaluate data quality, or to improve your understanding of data integration requirements. To work with data rules, you will always start by going to the Develop navigator menu in the console, and select Data Quality. This will get you to the starting point of creating and working with data rule functionality. From the Data Quality workspace you can: • Create data rule definitions, rule set definitions, data rules, rule sets, and metrics • Build data rule definition, rule set definition, and metric logic • Create data rule definition and rule set definition associations • Associate a data rule definition, rule set definition, metric, data rule, or rule set with folders • Associate a data rule definition, rule set definition, metric, data rule, or rule set with IBM® InfoSphere™ Business Glossary terms, policies, and contacts • Build data rule definitions or rule set definitions by using the rule builder • Add a data rule definition with the free form editor Characteristics of data rule functionality You can use data rules to evaluate and analyze conditions found during data profiling, to conduct a data quality assessment, to provide more information to a data integration effort, or to establish a framework for validating and measuring data quality over time.
  • 3. You can construct data rules in a generic fashion through the use of rule definitions. These definitions describe the rule evaluation or condition. By associating physical data sources to the definition, a data rule can be run to return analysis statistics and detail results. Creating a rule definition Creating a rule definition requires two components: a name for the rule definition and a logical statement (the rule logic) about what the rule definition tests or evaluates. Incomplete, empty, or invalid data values affect the quality of the data in your project by interrupting data integration processes and by using up memory on source systems. You can create rule definitions to analyze data for completeness and validity to find these anomalies. You can create a rule definition by defining the name and description of the rule, and by using the free form editor or rule logic builder to complete the rule logic for the rule definition. Procedure 1. From the Develop icon in the Navigator menu in the console, select Data Quality. 2. Click New Rule Definition in the Tasks list, located on the right side of the screen. 3. Enter a name for the new rule definition in the Name field. 4. Optional: Enter a brief description of your rule definition in the Short Description field. 5. Optional: Enter a longer description of your rule definition in the Long Description field. 6. Optional: In the Validity Benchmark section, check Include Benchmark to set benchmarks to check the validity of your data. 7. Click Save. Generating a data rule from a rule definition After you create rule definition logic, you can create a data rule to analyze real data in your projects. Procedure 1. From the Develop icon on the Navigator menu, select Data Quality. 2. Highlight the rule definition that you want to generate a data rule from. 3. In the Tasks menu on the right side of the screen, click Generate Data Rule or Rule Set. 4. On the Overview tab, type a name for the data rule. The name must contain at least one character and cannot contain the slash () character. The name must be unique in your project. 5. Optional: Type a short description and long description of your data rule. The Created By, Created On, and Last Modified fields are automatically populated after you create and save your data rule. You can optionally provide information into the Owner and Data Stewardfields. 6. Decide if you would like to set a validity benchmark for your data rule. Benchmarks quantify the quality of your data, as well as monitor your data. Click the Monitor Records Flagged by One or
  • 4. More Rules check box in the Validity Benchmark box, if you would like to monitor records that are marked for other rules in your project. 7. At the top of the workspace, switch from the Overview tab to the Bindings and Output tab. 8. Click Save to create the data rule. Setting benchmarks for data rules You can set a validity benchmark either when you initially create a rule definition, or when you generate the data rule, in order to quantify the quality of your data, as well as monitor your data. Validity benchmark The validity benchmark establishes the level or tolerance you have for exceptions to the data rule. The benchmark indicates whether sufficient records have met or not met the rule in order to mark a specific execution of the rule as having passed or failed to meet the benchmark. Select Monitor records that do not meet one or more rules in the data rule workspace. You can define the validity benchmark by using the following options that can be found in the menu in the validity benchmark workspace. Start by selecting one of the following options: % Not Met Determines the percentage of records that did not meet the rule logic in the data rule. You can set the benchmark to display a pass or fail condition when this value is greater than, less than, or equal to a reference value that you specify. For example, to ensure that the percentage of records that do not meet a data rule never exceeds or falls below 10%, you would set the benchmark to "% Not Met % <= 10." # Not Met Determines the number of records that did not meet the rule logic in your data rule. You can set the benchmark to display a pass or fail condition when this value is greater than, less than, or equal to a reference value that you specify. For example, to ensure that the percentage of records that do not meet a data rule never exceeds or falls below 1000, you would set the benchmark to "# Not Met <= 1000." % Met Determines the percentage of records that meet the rule logic in your data rule. You can set the benchmark to display a pass or fail condition when this value is greater than, less than, or equal to a reference value that you specify. For example, to ensure that the percentage of records that meet the data rule never falls below 90%, you would set the benchmark to "Met % >= 90." # Met Determines the number of records that meet the rule logic in your data rule. You can set the benchmark to display a pass or fail condition when this value is greater than, less than, or equal
  • 5. to a reference value that you specify. For example, to ensure that the number of records that meet the data rule never falls below 9000, you would set the benchmark to "Met # >= 9000." Creating a rule set definition To create a rules set definition, select two or more data rule definitions or data rules and add them to the rule set. When a rule set is executed, the data will be evaluated based on the conditions of all rule definitions and data rules included in the rule set. A rule set definition allows you to define a series of data rule definitions as one combined rule definition. After you define your rule set definition, you generate a rule set out of the rule set definition. When you generate a rule set, you bind all of the variables from your data rule definitions, such as "first_name" and "column_a," to actual data in your data sources, such as "Fred" or "division_codes." Your representational rule set definition elements are generated as a rule set that is bound to real data. Once your rule set is generated, you run the rule set to gather information on the data in your projects. Procedure 1. From the Develop Navigator menu in the console, select Data Quality. 2. Click New Rule Set Definition in the Tasks list, located on the right side of the screen. 3. Enter a name for the new rule in the Name field. 4. Optional: Enter a brief description of your data rule in the Short Description field. 5. Optional: Enter a longer description of your data rule in the Long Description field. 6. Optional: Select any benchmarks that you want to set for the rule set definition. You can set a Validity Benchmark, Confidence Benchmark, or Baseline Comparison Benchmark. 7. Click Save. Creating a metric You can create a metric, which is an equation that you define, in order to develop a measurement you can apply against data rules, rule sets, and other metrics. You can create metrics to establish a set of key performance indicators (KPI) around the data quality of the sources that are being evaluated. You can use metrics to aggregate the results of multiple rules and rule sets to provide you with a higher level of key performance indicators across multiple sources. Procedure 1. From the Develop Navigator menu in the console, select Data Quality. 2. Click New Metric in the Tasks pane. 3. Required: In the Name field, type a name for the metric.
  • 6. 4. Optional: Provide a short and long description. 5. Optional: In the Validity Benchmark section, select Include Benchmark to set benchmarks to check the validity of your data. 6. To associate the metric with a folder: a. Select the Folders view. b. Click Add. You can search for folders by name and select folders to be associated with the metric. 7. To develop the logic for the new metric, select from a variety of predefined metric combinations to build logic for the metric: a. Click the Measures tab. b. Select an opening parenthesis if you are grouping lines of logic to form a single condition. c. Compose a metric expression that can include rules, rule sets, metric executables, and functions from the Quality Control and Functions tabs on the Expression palette. d. Select a closing parenthesis if you are grouping lines of logic to form a single condition. e. Select a Boolean operator. 8. Save the new metric. Advanced analysis and monitoring • Enables users to easily classify data, display data using semantics, validate column/table relationships and move to exception rows for further analysis. • Provides data quality assessment functions such as column, primary key, foreign key, cross-domain and baseline analysis, and offers 80 configurable reports for visualizing analysis and trends. • Uses the IBM Information Server scheduling service to allow scheduled execution of profiling, rules and metrics. • Provides auditing, tracking and monitoring of data quality conditions over time to support data governance initiatives. • Uses project-, role- and user-based approaches to control access to sensitive information, including the ability to restrict access to original data sources. Integrated rules analysis • Provides common data rules to perform trending, pattern analysis and establish baselines consistently over data sources.
  • 7. • Offers multiple-level rules analysis (by rule, record, pattern) for evaluating data issues by record rather than in isolation. • Provides pre-packaged data validation rules to reduce development time. • Offers exception-based management of business rules and transformations. Scalable, collaborative platform • Provides native parallel execution for enterprise scalability to support large volumes of data. • Supports multiple analytical reviews and asynchronous profiling to allow more than one user to work in a project-based context. • Uses virtual tables and columns for analyzing data without requiring changes to a host database. • Provides annotations to enable users to add their business names, descriptions, business terms and other attributes to tables, columns and rules. Support for heterogeneous data • Uses open database connectivity (ODBC) or native connectivity to profile IBM DB2, IBM Informix, Oracle, Microsoft SQL Server, Sybase, Microsoft Access, Teradata and other data sources such as text files. • Allows reuse and sharing of data rules in IBM InfoSphere DataStage through IBM InfoSphere QualityStage and InfoSphere Information Analyzer to help you align data quality metrics throughout the project lifecycle. • Uses metadata to allow analytical results to be shared across all IBM InfoSphere Information Server modules. • Integrates with IBM InfoSphere Metadata Workbench and IBM InfoSphere Business Glossary. • Integrates with IBM InfoSphere Information Analyzer for Linux on System z®, allowing you to perform data quality functions directly on the mainframe.