1. InfoSphere Information Analyzer
Data quality assessment, analysis and monitoring
Information Analyzer is an IBM product and a tool widely used for data profiling. Information Analyzer
helps in understanding data structure, format, relationships and quality monitoring. Information
Analyzer is also referred to as WebSphere Information Analyzer.
Information Analyzer has extensive data profiling capabilities. It is available with a user interface that
includes a set of controls designed for integrating development work flow. The four major data profiling
functions within Information Analyzer are:
Column Analysis: Generates a full-frequency distribution and examines column values to infer properties
and definitions such as statistical measures and domain values.
Primary Key Analysis: Identifies candidate keys for one or more tables and aids in testing column
combinations or columns to determine whether the candidate is suitable for forming a primary key
Foreign Key Analysis: Examines the relationships and contents across tables, thereby identifying foreign
key and referential check integrity.
Cross-Domain Analysis: Identifies overlap in values between columns and redundancy of data between
tables.
IBM InfoSphere Information Analyzer provides data quality assessment, data quality monitoring and
data rule design and analysis capabilities. This software helps you derive more meaning from your
enterprise data, reduces the risk of proliferating incorrect information, facilitates the delivery of trusted
content, and helps to lower data integration costs.
InfoSphere Information Analyzer features include:
• Advanced analysis and monitoring provides source system profiling and analysis capabilities to help
you classify and assess your data.
• Integrated rules analysis uses data quality rules for greater validation, trending and pattern analysis.
• Scalable, collaborative platform enables sharing of information and results across the enterprise.
• Support for heterogeneous data enables you to assess information over a wide range of systems
and data sources.
Methodology and best practices
IBM InfoSphere Information Analyzer to understand the content, structure, and overall quality of your
data at a given point in time.
2. The analysis methodology and best practices provides a deeper insight into the analytical methods
employed by IBM InfoSphere Information Analyzer to analyze source data and rules.
The information is organized by analytical function. It gives you both in-depth knowledge and best
practices for:
• Data analysis, including:
o Applying data analysis system functionality
o Applying data analysis techniques within a function
o Interpreting data analysis results
o Making decisions or taking actions based on analytical results
• Data quality analysis and monitoring, including:
o Supporting business-driven rule definition and organization
o Applying rules and reusing consistently across data sources
o Leveraging multi-level rule analysis to understand broader data quality issues
o Evaluating rules against defined benchmarks/thresholds
o Assessing and annotating data quality results
o Monitoring trends in data quality over time
o Deploying rules across environments
o Running adhoc, scheduled, or command line execution options
Analyzing data by using data rules
The topics in this section describe how to define and execute data rules, which evaluate or validate
specific conditions associated with your data sources. Data rules can be used to extend your data
profiling analysis, to test and evaluate data quality, or to improve your understanding of data integration
requirements.
To work with data rules, you will always start by going to the Develop navigator menu in the console,
and select Data Quality. This will get you to the starting point of creating and working with data rule
functionality.
From the Data Quality workspace you can:
• Create data rule definitions, rule set definitions, data rules, rule sets, and metrics
• Build data rule definition, rule set definition, and metric logic
• Create data rule definition and rule set definition associations
• Associate a data rule definition, rule set definition, metric, data rule, or rule set with folders
• Associate a data rule definition, rule set definition, metric, data rule, or rule set with IBM®
InfoSphere™ Business Glossary terms, policies, and contacts
• Build data rule definitions or rule set definitions by using the rule builder
• Add a data rule definition with the free form editor
Characteristics of data rule functionality
You can use data rules to evaluate and analyze conditions found during data profiling, to conduct a data
quality assessment, to provide more information to a data integration effort, or to establish a
framework for validating and measuring data quality over time.
3. You can construct data rules in a generic fashion through the use of rule definitions. These definitions
describe the rule evaluation or condition. By associating physical data sources to the definition, a data
rule can be run to return analysis statistics and detail results.
Creating a rule definition
Creating a rule definition requires two components: a name for the rule definition and a logical
statement (the rule logic) about what the rule definition tests or evaluates. Incomplete, empty, or
invalid data values affect the quality of the data in your project by interrupting data integration
processes and by using up memory on source systems. You can create rule definitions to analyze data
for completeness and validity to find these anomalies.
You can create a rule definition by defining the name and description of the rule, and by using the free
form editor or rule logic builder to complete the rule logic for the rule definition.
Procedure
1. From the Develop icon in the Navigator menu in the console, select Data Quality.
2. Click New Rule Definition in the Tasks list, located on the right side of the screen.
3. Enter a name for the new rule definition in the Name field.
4. Optional: Enter a brief description of your rule definition in the Short Description field.
5. Optional: Enter a longer description of your rule definition in the Long Description field.
6. Optional: In the Validity Benchmark section, check Include Benchmark to set benchmarks to
check the validity of your data.
7. Click Save.
Generating a data rule from a rule definition
After you create rule definition logic, you can create a data rule to analyze real data in your projects.
Procedure
1. From the Develop icon on the Navigator menu, select Data Quality.
2. Highlight the rule definition that you want to generate a data rule from.
3. In the Tasks menu on the right side of the screen, click Generate Data Rule or Rule Set.
4. On the Overview tab, type a name for the data rule. The name must contain at least one
character and cannot contain the slash () character. The name must be unique in your project.
5. Optional: Type a short description and long description of your data rule. The Created
By, Created On, and Last Modified fields are automatically populated after you create and save
your data rule. You can optionally provide information into the Owner and Data Stewardfields.
6. Decide if you would like to set a validity benchmark for your data rule. Benchmarks quantify the
quality of your data, as well as monitor your data. Click the Monitor Records Flagged by One or
4. More Rules check box in the Validity Benchmark box, if you would like to monitor records that
are marked for other rules in your project.
7. At the top of the workspace, switch from the Overview tab to the Bindings and Output tab.
8. Click Save to create the data rule.
Setting benchmarks for data rules
You can set a validity benchmark either when you initially create a rule definition, or when you generate
the data rule, in order to quantify the quality of your data, as well as monitor your data.
Validity benchmark
The validity benchmark establishes the level or tolerance you have for exceptions to the data rule. The
benchmark indicates whether sufficient records have met or not met the rule in order to mark a specific
execution of the rule as having passed or failed to meet the benchmark.
Select Monitor records that do not meet one or more rules in the data rule workspace.
You can define the validity benchmark by using the following options that can be found in the menu in
the validity benchmark workspace. Start by selecting one of the following options:
% Not Met
Determines the percentage of records that did not meet the rule logic in the data rule. You can
set the benchmark to display a pass or fail condition when this value is greater than, less than,
or equal to a reference value that you specify. For example, to ensure that the percentage of
records that do not meet a data rule never exceeds or falls below 10%, you would set the
benchmark to "% Not Met % <= 10."
# Not Met
Determines the number of records that did not meet the rule logic in your data rule. You can set
the benchmark to display a pass or fail condition when this value is greater than, less than, or
equal to a reference value that you specify. For example, to ensure that the percentage of
records that do not meet a data rule never exceeds or falls below 1000, you would set the
benchmark to "# Not Met <= 1000."
% Met
Determines the percentage of records that meet the rule logic in your data rule. You can set the
benchmark to display a pass or fail condition when this value is greater than, less than, or equal
to a reference value that you specify. For example, to ensure that the percentage of records that
meet the data rule never falls below 90%, you would set the benchmark to "Met % >= 90."
# Met
Determines the number of records that meet the rule logic in your data rule. You can set the
benchmark to display a pass or fail condition when this value is greater than, less than, or equal
5. to a reference value that you specify. For example, to ensure that the number of records that
meet the data rule never falls below 9000, you would set the benchmark to "Met # >= 9000."
Creating a rule set definition
To create a rules set definition, select two or more data rule definitions or data rules and add them to
the rule set. When a rule set is executed, the data will be evaluated based on the conditions of all rule
definitions and data rules included in the rule set.
A rule set definition allows you to define a series of data rule definitions as one combined rule
definition. After you define your rule set definition, you generate a rule set out of the rule set definition.
When you generate a rule set, you bind all of the variables from your data rule definitions, such as
"first_name" and "column_a," to actual data in your data sources, such as "Fred" or "division_codes."
Your representational rule set definition elements are generated as a rule set that is bound to real data.
Once your rule set is generated, you run the rule set to gather information on the data in your projects.
Procedure
1. From the Develop Navigator menu in the console, select Data Quality.
2. Click New Rule Set Definition in the Tasks list, located on the right side of the screen.
3. Enter a name for the new rule in the Name field.
4. Optional: Enter a brief description of your data rule in the Short Description field.
5. Optional: Enter a longer description of your data rule in the Long Description field.
6. Optional: Select any benchmarks that you want to set for the rule set definition. You can set
a Validity Benchmark, Confidence Benchmark, or Baseline Comparison Benchmark.
7. Click Save.
Creating a metric
You can create a metric, which is an equation that you define, in order to develop a measurement you
can apply against data rules, rule sets, and other metrics.
You can create metrics to establish a set of key performance indicators (KPI) around the data quality of
the sources that are being evaluated. You can use metrics to aggregate the results of multiple rules and
rule sets to provide you with a higher level of key performance indicators across multiple sources.
Procedure
1. From the Develop Navigator menu in the console, select Data Quality.
2. Click New Metric in the Tasks pane.
3. Required: In the Name field, type a name for the metric.
6. 4. Optional: Provide a short and long description.
5. Optional: In the Validity Benchmark section, select Include Benchmark to set benchmarks to
check the validity of your data.
6. To associate the metric with a folder:
a. Select the Folders view.
b. Click Add. You can search for folders by name and select folders to be associated with
the metric.
7. To develop the logic for the new metric, select from a variety of predefined metric combinations
to build logic for the metric:
a. Click the Measures tab.
b. Select an opening parenthesis if you are grouping lines of logic to form a single
condition.
c. Compose a metric expression that can include rules, rule sets, metric executables, and
functions from the Quality Control and Functions tabs on the Expression palette.
d. Select a closing parenthesis if you are grouping lines of logic to form a single condition.
e. Select a Boolean operator.
8. Save the new metric.
Advanced analysis and monitoring
• Enables users to easily classify data, display data using semantics, validate column/table
relationships and move to exception rows for further analysis.
• Provides data quality assessment functions such as column, primary key, foreign key, cross-domain
and baseline analysis, and offers 80 configurable reports for visualizing analysis and trends.
• Uses the IBM Information Server scheduling service to allow scheduled execution of profiling, rules
and metrics.
• Provides auditing, tracking and monitoring of data quality conditions over time to support data
governance initiatives.
• Uses project-, role- and user-based approaches to control access to sensitive information, including
the ability to restrict access to original data sources.
Integrated rules analysis
• Provides common data rules to perform trending, pattern analysis and establish baselines
consistently over data sources.
7. • Offers multiple-level rules analysis (by rule, record, pattern) for evaluating data issues by record
rather than in isolation.
• Provides pre-packaged data validation rules to reduce development time.
• Offers exception-based management of business rules and transformations.
Scalable, collaborative platform
• Provides native parallel execution for enterprise scalability to support large volumes of data.
• Supports multiple analytical reviews and asynchronous profiling to allow more than one user to
work in a project-based context.
• Uses virtual tables and columns for analyzing data without requiring changes to a host database.
• Provides annotations to enable users to add their business names, descriptions, business terms and
other attributes to tables, columns and rules.
Support for heterogeneous data
• Uses open database connectivity (ODBC) or native connectivity to profile IBM DB2, IBM Informix,
Oracle, Microsoft SQL Server, Sybase, Microsoft Access, Teradata and other data sources such as
text files.
• Allows reuse and sharing of data rules in IBM InfoSphere DataStage through IBM InfoSphere
QualityStage and InfoSphere Information Analyzer to help you align data quality metrics throughout
the project lifecycle.
• Uses metadata to allow analytical results to be shared across all IBM InfoSphere Information Server
modules.
• Integrates with IBM InfoSphere Metadata Workbench and IBM InfoSphere Business Glossary.
• Integrates with IBM InfoSphere Information Analyzer for Linux on System z®, allowing you to
perform data quality functions directly on the mainframe.