Dm unit i r16

DATA WAREHOUSING AND DATA MINING
UNIT-I
Kishore Kumar M

UNIT - I
kishore.mamidala@gmail.com
2
Introduction to Data Mining :
 Introduction
 What is Data Mining, Definition, KDD, Challenges
 Data Mining Tasks
 Data Preprocessing,
 Data Cleaning, Missing data,
 Dimensionality Reduction,
 Feature Subset Selection,
 Discretization and Binaryzation,
 Data Transformation;
 Measures of Similarity and Dissimilarity- Basics.
 INTRODUCTION:
 WHAT MOTIVATED DATA MINING? WHY IS IT IMPORTANT?
The major reason that data mining has attracted a great deal of attention in information
industry in recent years is due to the wide availability of huge amounts of data and the
imminent need for turning such data into useful information and knowledge. The
information and knowledge gained can be used for applications ranging from business
management, production control, and market analysis, to engineering design and
science exploration.
Data mining can be viewed as a result of the natural evolution of information
technology. An evolutionary path has been witnessed in the database industry in the
development of the following functionalities (Figure 1.1): data collection and database
creation, data management (including data storage and retrieval, and database
transaction processing), and data analysis and understanding (involving data
warehousing and data mining).
Since the 1960's, database and information technology has been evolving
systematically from primitive file processing systems to sophisticated and powerful
databases systems. The research and development in database systems since 1970‘s has
led to the development of relational database systems structures, data modeling tools,
and indexing and data organization techniques. In addition efficient methods for on-line
transaction processing (OLTP).
Database technology since the mid-1980s has been characterized by the popular
adoption of relational technology and an upsurge of research and development activities
on new and powerful database systems. These promote the development of advanced
data models such as extended-relational, object-oriented, object-relational, and
deductive models Application oriented database systems, including spatial, temporal,
multimedia, active, and scientific and engineering databases. Heterogeneous database

UNIT - I
3
systems and Internet-based global information systems such as the World-Wide Web
(WWW) also emerged and play a vital role in the information industry.
Data can now be stored in many different types of databases and information
repositories.
Data warehouse technology includes data cleaning, data integration, and On-
Line Analytical Processing (OLAP), that is, analysis techniques with functionalities
such as summarization, consolidation and aggregation, as well as the ability to view
information from different angles.
The abundance of data, coupled with the need for powerful data analysis tools,
has been described as a ―data rich but information poor" situation. The fast-growing,
tremendous amount of data, collected and stored in large and numerous databases, has
far exceeded our human ability for comprehension without powerful tools.
As a result, data collected in large databases become ―data tombs"
Consequently, important decisions are often made based not on the information-rich
data stored in databases but typically rely on users or domain experts to manually input
knowledge into knowledge bases. Unfortunately, this procedure is prone to biases and
errors, and is extremely time-consuming and costly. Data mining tools perform data
analysis and may uncover important data patterns, contributing greatly to business
strategies, knowledge
bases, and scientific and medical research.

UNIT - I
4
Fig. The evolution of database system technology
- Database management systems
(1970‘s)
- data modeling tools
- indexing and data organization techniques
- query languages and query processing
- user interfaces
- optimization methods
- on-line transactional processing (OLTP)
Data collection and database creation
(1960‘s and earlier)
- primitive file processing
Advanced databases systems
(mid 1980‘s-present)
Advanced data
maodels:relational,object
relational,,etc.
Advanced applications:
spatial,temporal,multimedia,scien
tific and engineering.
Advanced Data Analysis:
Data warehousing & Data
mining (1080‘s – present)
Data warehouse and OLAP
Data Mining and knowledge
discovery
Advanced Data mining
applications
Web-based
databases (1990‘s-
present)
XML based database
systems
Integration with
information retrieval
Data & information
integration
Next Generation of Integrated Data &
Information systems (present-future)

UNIT - I
5
 WHAT IS DATA MINING?
Data mining refers to extracting or ―mining" knowledge from large amounts of data.
The term is actually a misnomer (unsuitable name). Remember that the mining of gold
from rocks or sand is referred to as gold mining rather than rock or sand mining. Thus,
―data mining" should have been more appropriately named ―knowledge mining from
data", which is unfortunately somewhat long. Thus, such a misnomer which carries
both ‖data" and ―mining" became a popular choice. There are many other terms
carrying a similar or slightly different meaning to data mining, such as knowledge
mining from data, knowledge extraction, data/pattern analysis, data archaeology,
and data dredging.
Another popularly used term, “Knowledge Discovery from Data” or KDD.
Knowledge discovery consistes of sequence of the following steps:
1. Data cleaning (to remove noise or irrelevant data),
2. Data integration (where multiple data sources may be combined)
3. Data selection (where data relevant to the analysis task are retrieved from the
database)
4. Data transformation (where data are transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations)
5. Data mining (an essential process where intelligent methods are applied in order to
extract data patterns),
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge
based on some interestingness measures)
7. Knowledge presentation (where visualization and knowledge representation
techniques are used to present the mined knowledge to the user).

UNIT - I
6
 Architecture of typical data mining System:
Data mining is the process of discovering interesting knowledge from large amounts
of data stored either in databases, data warehouses, or other information repositories.
 Database, data warehouse, or other information repository: This is one or a
set of databases, data warehouses, spread sheets, or other kinds of information
repositories. Data cleaning and data integration techniques may be performed on
the data.
 Database or data warehouse server: The database or data warehouse server is
responsible for fetching the relevant data, based on the user's data mining request.
 Knowledge base: This is the domain knowledge that is used to guide the search,
or evaluate the interestingness of resulting patterns. Such knowledge can include
concept hierarchies, used to organize attributes or attribute values into different
levels of abstraction.
 Data mining engine: This is essential to the data mining system and ideally
consists of a set of functional modules for tasks such as characterization,
association analysis, classification, evolution and deviation analysis.
 Pattern evaluation module: This component typically employs interestingness
measures and interacts with the data mining modules so as to focus the search
towards interesting patterns.
 Graphical user interface. This module communicates between users and the
data mining system, allowing the user to interact with the system by specifying a
data mining query or task, providing information to help focus the search, and
performing exploratory data mining based on the intermediate data mining results.
Fig. Architecture of typical data mining System

UNIT - I
7
 Data mining -on what kind of data?.
Data mining should be applicable to any kind of information repository. This includes
relational databases, data warehouses, transactional databases, advanced database
systems, flat files, data streams and the World-Wide Web. Advanced database systems
include object-oriented and object-relational databases, and specific application-oriented
databases,such as spatial databases, time-series databases, text databases, and multimedia
databases.
o Relational databases:
A database system, also called a database management system (DBMS), consists of a
collection of interrelated data, known as a database, and a set of software programs to
manage and access the data.
A relational database is a collection of tables, each of which is assigned a unique name.
Each table consists of a set of attributes (columns or fields) and usually stores a large
number of tuples (records or rows). Each tuple in a relational table represents an object
identified by a unique key and described by a set of attribute values.
Consider the following example.
Example: The AllElectronics company is described by the following relation tables:
customer, item, employee.
o The relation customer consists of a set of attributes, including a unique customer
identity number (cust ID),
o Similarly, each of the relations item and employee consists of a set of attributes,
describing their properties.
Fig : Fragments of relations from a relational database for AllElectronics

UNIT - I
8
 Data Warehouses:
A Data Warehouse is a repository of information collected from multiple sources,stored
under a unified schema, and that usually resides at a single site.Data warehouses are
constructed via a process of data cleaning, data transformation, data integration, data
loading, and periodic data refreshing.
In order to facilitate decision making, the data in a data warehouse are
organized around major subjects. The data are stored to provide information from a
historical perspective (such as from the past 5-10 years), and are typically summarized.
A data warehouse is usually modeled by a multidimensional database
structure, where each dimension corresponds to an attribute or a set of attributes in the
schema, and each cell stores the value of some aggregate measure. The actual physical
structure of a data warehouse may be a relational data store or a multidimensional data
cube. It provides a multidimensional view of data and allows the pre computation and
fast accessing of summarized data.
Fig : Architecture of a typical data warehouse.
 Transactional Databases:
A transactional database consists of a file where each record represents a
transaction. A transaction typically includes a unique transaction identity number (trans
ID), and a list of the items making up the transaction(such as items purchased in a store).
The transactional database may have additional tables associated with it, which contain
other information regarding the sale, such as the date of the transaction, the customer ID
number, the ID number of the sales person, and of the branch at which the sale occurred,
and so on.
Tab : Fragment of a transactional database for sales at AllElectronics.
Trans_ID List of item_IDs
T100 11,13,18,116
T200 12,18
. . . . . .

UNIT - I
9
 Advanced Data and Information Systems and Advanced Applications:
o Object-relational database: An object-relational database (ORD), or object-relational
database management system (ORDBMS), is a database management system
(DBMS) similar to a relational database, but with an object-oriented database model:
objects, classes and inheritance are directly supported in database schemas and in the
query language. In addition, just as with proper relational systems, it supports
extension of the data model with custom data-types and methods.
o Temporal database: A temporal database is a database with built-in support for
handling data involving time, for example a temporal data model and a temporal
version of Structured Query Language (SQL).
o Sequence database: It stores sequences of ordered events, with or without concrete
notion of time. Examples include customer shopping sequences, web click streams,
and biological sequences.
o Time-Series database: It is a software system that is optimized for handling a time
series. (eg. daily, weekly, hourly) includes data collected from stock exchange,
observation of natural phenomena.
o Spatial database: A spatial database is a database that is optimized to store and query
data that represents objects defined in a geometric space. Most spatial databases
allow representing simple geometric objects such as points, lines and polygons.
Some spatial databases handle more complex structures such as 3D objects,
topological coverages, linear networks, and TINs.
o Spatiotemporal database: A spatiotemporal database is a database that manages both
space and time information. Common examples include:
- Tracking of moving objects, which typically can occupy only a single position at a
given time.
- A database of wireless communication networks, which may exist only for a short
time span within a geographic region.
- Historical tracking of plate tectonic activity.
o Text database: These are databases which contain word descriptions for objects.
These words are not simple keywords rather long sentences or paragraphs, such as
error or bug reports, warning messages, summary reports or other documents.
o Multimedia database: A multimedia database is a database that hosts one or more
primary media file types such as .txt (documents), .jpg (images), .swf (videos), .mp3
(audio), etc. And loosely fall into three main categories:

UNIT - I
10
Static media (time-independent, i.e. images and handwriting)
Dynamic media (time-dependent, i.e. video and sound bytes)
Dimensional media
o Legacy database: It is a group of heterogeneous database that combines different
kind of data systems, such as network databases, spreadsheets, object oriented or file
systems. Heterogeneous databases in legacy database may be connected by inter
computer networks like intranet or internet.
o Heterogeneous database: It consists of a set of inter connected, autonomous
component databases. The components communicate in order to exchange
information and answer queries. Objects in one component database may differ
greatly from objects in other component databases, making it difficult to assimilate
their semantics into the overall heterogeneous database.
o Flat Files: A flat file database describes any of various means to encode a database
model (most commonly a table) as a single file.
A flat file can be a plain text file or a binary file. There are usually no
structural relationships between the records.
o Data streams: A data stream is a sequence of digitally encoded coherent signals
(packets of data or data packets) used to transmit or receive information that is in the
process of being transmitted.
In electronics and computer architecture, a data flow determines for which
time which data item is scheduled to enter or leave which port of a systolic array, a
Reconfigurable Data Path Array or similar pipe network, or other processing unit or
block.
o World Wide Web: The World Wide Web (abbreviated as WWW or commonly
known as the web), is a system of interlinked hypertext documents accessed via the
Internet. With a web browser, one can view web pages that may contain text, images,
videos, and other multimedia, and navigate between them via hyperlinks.
 Data mining functionalities |: what kinds of patterns can be mined?
Various types of data stores and database systems on which data mining can be
performed.
Data mining functionalities are used to specify the kind of patterns to be found in data
mining tasks. In general, data mining tasks can be classified into two categories:
descriptive and predictive.
Descriptive mining tasks characterize the general properties of the data in the database.
Predictive mining tasks perform inference on the current data in order to make
predictions.

UNIT - I
11
In some cases, users may have no idea of which kinds of patterns in their data may be
interesting, and hence may like to search for several different kinds of patterns in parallel.
Thus it is important to have a data mining system that can mine multiple kinds of patterns
to accommodate different user expectations or applications. Data mining systems should
also allow users to specify hints to guide or focus the search for interesting patterns.
Data mining functionalities, and the kinds of patterns they can discover, are
described below.
1. CONCEPT/CLASS DESCRIPTION: characterization and discrimination:
Data can be associated with classes or concepts. It can be useful to describe individual
classes and concepts in summarized, concise, and yet precise terms. Such descriptions
of a class or a concept are called class/concept descriptions. These descriptions can be
derived via (1) data characterization, by summarizing the data of the class under study
(often called the target class) in general terms, or (2) data discrimination, by comparison
of the target class with one or a set of comparative classes (often called the contrasting
classes), or (3) both data characterization and discrimination.
Data characterization is a summarization of the general characteristics or features
of a target class of data. The data corresponding to the user-specified class are typically
collected by a database query. For example, to study the characteristics of software
products whose sales increased by 10% in the last year, one can collect the data related to
such products by executing an SQL query.
There are several methods for effective data summarization and
characterization.
An attribute-oriented induction technique can be used to perform data
generalization and characterization without step-by-step user interaction.
The output of data characterization can be presented in various forms.
Examples include pie charts, bar charts, curves, multidimensional data cubes, and
multidimensional tables, including cross tabs. The resulting descriptions can also be
presented as generalized relations, or in rule form (called characteristic rules).
Example : A data mining system should be able to produce a description summarizing
the characteristics of customers who spend more than : 1000 a year at AllElectronics.
The result could be a general profile of the customers such as they are 40-50 years old,
employed, and have excellent credit ratings. The system should allow users to drill-
down on any dimension, such as on ―occupation" in order to view these customers
according to their employment.
Data discrimination is a comparison of the general features of target class data
objects with the general features of objects from one or a set of contrasting classes. The
target and contrasting classes can be specified by the user, and the corresponding data
objects retrieved through data base queries. For example, one may like to compare the

UNIT - I
12
general features of software products whose sales increased by 10% in the last year with
those whose sales decreased by at least 30% during the same period.
The methods used for data discrimination are similar to those used for data
characterization. The forms of output presentation are also similar, although
discrimination descriptions should include comparative measures which help distinguish
between the target and contrasting classes. Discrimination descriptions expressed in rule
form are referred to as discriminant rules. The user should be able to manipulate the
output for characteristic and discriminant descriptions.
Example : A data mining system should be able to compare two groups of AllElectronics
customers, such as those who shop for computer products regularly (more than 4 times a
month) vs. those who rarely shop for such products (i.e., less than three times a year).
The resulting description could be a general, comparative profile of the customers such as
80% of the customers who frequently purchase computer products are between 20-40
years old and have a university education, whereas 60% of the customers who
infrequently buy such products are either old or young, and have no university degree.
Drilling-down on a dimension, such as occupation, or adding new dimensions, such as
income level, may help in finding even more discriminative features between the two
classes.
2. ASSOCIATION ANALYSIS:
Association analysis is the discovery of association rules showing attribute-value
conditions that occur frequently together in a given set of data. Association analysis is
widely used for market basket or transaction data analysis.
Example : Given the All Electronics relational database, is
Buys(X, ―computer‖) => buys(X, ―software‖) [support = 1%,confidence = 50%]
Where X is a variable representing a customer. A confidence, or certainty, of 50%
means that if a customer buys a computer, there is a 50% chance that she will buy
software as well. A 1% support means that 1% of all of the transactions under analysis
showed that computer and software were purchased together. This association rule
involves a single attribute or predicate(i.e. buys), so it is referred to as single-
dimensional association rule.
A data mining system may find association rule like
age(X, ―20..29") ^ income(X, ―20K..30K") => buys(X, ―CD player")
[support = 2%, confidence = 60%]
The rule indicates that of the AllElectronics customers under study, 2% (support) are 20-
29 years of age with an income of 20-30K and have purchased a CD player. There is a
60% probability (confidence, or certainty) that a customer in this age and income group
will purchase a CD player.

UNIT - I
13
Note that this is an association between more than one attribute, or predicate (i.e., age,
income, and buys).Adopting the terminology used in multidimensional databases, where
each attribute is referred to as a dimension, the above rule can be referred to as a
multidimensional association rule.
3. CLASSIFICATION AND PREDICTION
Classification is the processing of finding a set of models (or functions)
which describe and distinguish data classes or concepts, for the purposes of being able to
use the model to predict the class of objects whose class label is unknown. The derived
model is based on the analysis of a set of training data (i.e., data objects whose class label
is known).
The derived model may be represented in various forms, such as classification
(IF-THEN) rules, decision trees, mathematical formulae, or neural networks.
A decision tree is a flow-chart-like tree structure, where each node denotes a test
on an attribute value, each branch represents an outcome of the test, and tree leaves
represent classes or class distributions. Decision trees can be easily converted to
classification rules.
A neural network is a collection of linear threshold units that can be trained to
distinguish objects of different classes.
Classification can be used for predicting the class label of data objects. However,
in many applications, one may like to predict some missing or unavailable data values
rather than class labels. This is usually the case when the predicted values are numerical
data, and is often specifically referred to as prediction. Although prediction may refer to
both data value prediction and class label prediction, it is usually conned to data value
prediction and thus is distinct from classification. Prediction also encompasses the
identification of distribution trends based on the available data.
Classification and prediction may need to be preceded by relevance analysis which
attempts to identify attributes that do not contribute to the classification or prediction
process. These attributes can then be excluded.
Example: Suppose, as sales manager of AllElectronics, you would like to classify a large
set of items in the store, based on three kinds of responses to a sales campaign: good
response, mild response, and no response.
The resulting classification should maximally distinguish each class from the others,
presenting an organized picture of the data set.
(a)
age(X, ―youth‖) AND income(X, ―high‖)  class(X, ―A‖)
age(X, ―youth‖) AND income(X, ―low‖)  class(X, ―B‖)
age(X, ―middle_aged‖)  class(X, ―C‖)
age(X, ―senior‖)  class(X, ―C‖)

UNIT - I
14
(b) (c)
Figure: A classification model can e represented in various forms,such as (a) IF-
THEN rules, (b) a decision tree,or (c) neural network.
4. CLUSTER ANALYSIS:
Unlike classification and predication, which analyze class-labeled data objects,
clustering analyzes data objects without consulting a known class label. In general, the
class labels are not present in the training data simply because they are not known to
begin with. Clustering can be used to generate such labels. The objects are clustered or
grouped based on the principle of maximizing the intra class similarity and minimizing
the interclass similarity. That is, clusters of objects are formed so that objects within a
cluster have high similarity in comparison to one another, but are very dissimilar to
objects in other clusters. Each cluster that is formed can be viewed as a class of objects,
from which rules can be derived.
Example : Clustering analysis can be performed on AllElectronics customer data in order
to identify homogeneous subpopulations of customers. These clusters may represent
individual target groups for marketing.
Figure: A 2-D plot of customer data with respect to customer locations in a city,
showing three data clusters. Each cluster `center' is marked with a `+'.
Age?
income Class C
Class A Class B
F1
F2
F3
F4
F5 F8
F7
F6
income
age
Middle_aged
youth
low
high

UNIT - I
15
5. OUTLIER ANALYSIS:
A database may contain data objects that do not comply with the general
behavior or model of the data. These data objects are outliers. Most data mining methods
discard outliers as noise or exceptions. The analysis of outlier data is referred to as outlier
mining.
Example: Outlier Analysis may uncover fraudulent usage of credit cards by detecting
purchases of extremely large amounts for a given account number in comparison to
regular changes incurred by the same account.
6. EVOLUTION ANALYSIS
Data evolution analysis is describes and models regularities or trends for objects
whose behavior changes over time. Although this may include characterization,
discrimination, association, classification, or clustering of time-related data, distinct
features of such an analysis include time-series data analysis, sequence or periodicity
pattern matching, and similarity-based data analysis.
Example: Suppose that you have the major stock market (time-series) data of the last
several years available from the New York Stock Exchange and you would like to invest
in shares of high-tech industrial companies. A data mining study of stock exchange data
may identify stock evolution regularities for overall stocks and for the stocks of
particular companies.
 A classification of data mining systems:
Data mining is an interdisciplinary field, the confluence of a set of disciplines (as
shown in Figure), including database systems, statistics, machine learning, visualization,
and information science. Moreover, depending on the data mining approach used,
techniques from other disciplines may be applied, such as neural networks, fuzzy and/or
rough set theory, knowledge representation, inductive logic programming, or high
performance computing.
Data mining systems can be categorized according to various criteria, as follows.
 Classification according to the kinds of databases mined.
A data mining system can be classified according to the kinds of databases mined.
Database systems themselves can be classified according to different criteria (such as
data models, or the types of data or applications involved), each of which may require its
own data mining technique.
 Classification according to the kinds of knowledge mined.
Data mining systems can be categorized according to the kinds of knowledge they
mine, i.e., based on data mining functionalities, such as characterization, discrimination,
association, classification, clustering, trend and evolution analysis, deviation analysis,

UNIT - I
16
similarity analysis, etc. A comprehensive data mining system usually provides multiple
and/or integrated data mining functionalities.
 Classification according to the kinds of techniques utilized.
Data mining systems can also be categorized according to the underlying data mining
techniques employed. These techniques can be described according to the degree of user
interaction involved (e.g., autonomous systems, interactive exploratory systems, query-
driven systems), or the methods of data analysis employed (e.g., database-oriented or data
warehouse-oriented techniques, machine learning, statistics, visualization, pattern
recognition, neural networks, and so on).
and/or integrated data mining functionalities.
 Classification according to applications adopted.
Data mining systems can also be categorized according to the applications they adapt.
For example, data mining systems may be tailored specifically for finance,
telecommunications, DNA, stock markets, e-mail, and so on. Different applications often
require the integration of application-specific methods. Therefore, a generic, all-purpose
data mining system may not fit domain-specific mining tasks.
Fig: Data mining as a confluence of multiple disciplines.
 Data Mining Task Primitives
A data mining task can be specified in the form of a data mining query, which is input
to the data mining system. A data mining query is defined in terms of data mining task
primitives. These primitives allow the user to interactively communicate with the data
mining system during discovery in order to direct the mining process,
 The set of task-relevant data to be mined: This specifies the portions of the database
or the set of data in which the user is interested. This includes the database attributes
or data warehouse dimensions of interest.
 The kind of knowledge to be mined: This specifies the data mining functions to be
performed, such as characterization, discrimination, association or correlation
analysis, classification, prediction, clustering, outlier analysis, or evolution analysis.

UNIT - I
17
 The background knowledge to be used in the discovery process: This knowledge
about the domain to be mined is useful for guiding the knowledge discovery process
and for evaluating the patterns found. Concept hierarchies are a popular form of
background knowledge, which allow data to be mined at multiple levels of
abstraction.
 The interestingness measures and thresholds for pattern evaluation: They may be
used to guide the mining process or, after discovery, to evaluate the discovered
patterns. Different kinds of knowledge may have different interestingness measures.
For example, interestingness measures for association rules include support and
confidence. Rules whose support and confidence values are below user-specified
thresholds are considered uninteresting.
 The expected representation for visualizing the discovered patterns: This refers to
the form in which discovered patterns are to be displayed,which may include rules,
tables, charts, graphs, decision trees, and cubes.
Figure Primitives for specifying a data mining task

UNIT - I
18
 Integration of a Data Mining System with a Database or DataWarehouse
System:
A critical question in the design of a data mining (DM) system is how to integrate or
couple the DM system with a database (DB) system and/or a data warehouse (DW)
system.
When a DM system works in an environment that requires it to communicate with
other information system components, such as DB and DW systems, possible integration
schemes include no coupling, loose coupling, semi tight coupling, and tight coupling.
No coupling: No coupling means that a DM system will not utilize any function of a DB
or DW system. It may fetch data from a particular source (such as a file system), process
data using some data mining algorithms, and then store the mining results in another file.
Without any coupling of such systems, a DM system will need to use other tools
to extract data, making it difficult to integrate such a system into an information
processing environment. Thus, no coupling represents a poor design.
Loose coupling: Loose coupling means that a DM system will use some facilities of a
DB or DW system, fetching data from a data repository managed by these systems,
performing data mining, and then storing the mining results either in a file or in a
designated place in a database or data warehouse.
Loose coupling is better than no coupling because it can fetch any portion of data
stored in databases or data warehouses by using query processing.
Loosely coupled mining systems are main memory-based, it is difficult for loose
coupling to achieve high scalability and good performance with large data sets.
Semitight coupling: Semitight coupling means that besides linking a DM system to a
DB/DW system, efficient implementations of a few essential data mining primitives (like
sorting, indexing, aggregation, histogram analysis, multiway join) can be provided in the
DB/DW system.
Moreover, some frequently used intermediate mining results can be pre-computed
and stored in the DB/DW system. Because these intermediate mining results are either
pre-computed or can be computed efficiently, this design will enhance the performance
of a DM system.
Tight coupling: Tight coupling means that a DM system is smoothly integrated into the
DB/DW system. The data mining subsystem is treated as one functional component of an
information system. Data mining queries and functions are optimized based on mining
query analysis, data structures, indexing schemes, and query processing methods of a DB
or DW system.
This approach is highly desirable because it facilitates efficient implementations
of data mining functions, high system performance, and an integrated information
processing environment.

UNIT - I
19
 Data Preprocessing:
There are a number of data preprocessing techniques. Data cleaning can be
applied to remove noise and correct inconsistencies in the data. Data integration merges
data from multiple sources into a coherent data store, such as a data warehouse or a data
cube. Data transformations, such as normalization, may be applied. For example,
normalization may improve the accuracy and efficiency of mining algorithms involving
distance measurements. Data reduction can reduce the data size by aggregating,
eliminating redundant features, or clustering, for instance. These data processing
techniques, when applied prior to mining, can substantially improve the overall data
mining results.
 Why preprocess the data?
The data you wish to analyze by data mining techniques are incomplete (lacking
attribute values or certain attributes of interest, or containing only aggregate data), noisy
(containing errors, or outlier values which deviate from the expected), and inconsistent
(e.g., containing discrepancies in the department codes used to categorize items).
Incomplete data can occur for a number of reasons. Attributes of interest may not
always be available, such as customer information for sales transaction data. Other data
may not be included simply because it was not considered important at the time of entry.
Data that were inconsistent with other recorded data may have been deleted.
Data can be noisy, having incorrect attribute values, owing to the following. The
data collection instruments used may be faulty. There may have been human or computer
errors occurring at data entry. Incorrect data may also result from inconsistencies in
naming conventions used. Duplicate tuples also require data cleaning.
Data cleaning routines work to ―clean" the data by filling in missing values,
smoothing noisy data, identifying or removing outliers, and resolving inconsistencies.
Dirty data can cause confusion for the mining procedure.
Figure Forms of data preprocessing

UNIT - I
20
 DATA CLEANING:
Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning routines
attempt to fill in missing values, smooth out noise while identifying outliers, and correct
inconsistencies in the data.
1. Missing values
Imagine that you need to analyze AllElectronics sales and customer data. You
note that many tuples have no recorded value for several attributes, such as customer
income.
i. Ignore the tuple: This is usually done when the class label is missing (assuming the
mining task involves classification or description). This method is not very effective,
unless the tuple contains several attributes with missing values.
ii. Fill in the missing value manually: In general, this approach is time-consuming and
may not be feasible given a large data set with many missing values.
iii. Use a global constant to fill in the missing value: Replace all missing attribute
values by the same constant, such as a label like ―Unknown", or ―ά ―(infinity).
iv. Use the attribute mean to fill in the missing value: For example, suppose that the
average income of AllElectronics customers is 28,000. Use this value to replace the
missing value for income.
v. Use the attribute mean for all samples belonging to the same class as the given
tuple: For example, if classifying customers according to credit risk, replace the
missing value with the average income value for customers in the same credit risk
category as that of the given tuple.
vi. Use the most probable value to fill in the missing value: This may be determined
with inference-based tools using a Bayesian formalism or decision tree induction. For
example, using the other customer attributes in your data set, you may construct a
decision tree to predict the missing values for income.
.2 Noisy data: Noise is a random error or variance in a measured variable. Given a
numeric attribute such as, say, price, how can we ―smooth" out the data to remove the
noise? Let's look at the following data smoothing techniques.
i. Binning methods: Binning methods smooth a sorted data value by consulting the
―neighborhood", or values around it. The sorted values are distributed into a number of
'buckets', or bins.
Fig. Binning methods for data smoothing.

UNIT - I
21
Figure illustrates some binning techniques. In this example, the data for price are first
sorted and partitioned into equi-depth bins (of depth 3).In smoothing by bin means, each
value in a bin is replaced by the mean value of the bin. For example, the mean of the
values 4, 8, and 15 in Bin 1 is 9. Therefore, each original value in this bin is replaced by
the value 9. Similarly, smoothing by bin medians can be employed, in which each bin
value is replaced by the bin median. In smoothing by bin boundaries, the minimum and
maximum values in a given bin are identified as the bin boundaries.
ii. Clustering: Outliers may be detected by clustering, where similar values are organized
into groups or ―clusters". Intuitively, values which fall outside of the set of clusters may
be considered outliers.
Fig: Outliers may be detected by clustering analysis.
iii. Combined computer and human inspection: Outliers may be identified through a
combination of computer and human inspection. In one application, for example, an
information-theoretic measure was used to help identify outlier patterns in a handwritten
character database for classification. The measure's value reflected the ―surprise" content
of the predicted character label with respect to the known label.
iv. Regression: Data can be smoothed by fitting the data to a function, such as with
regression. Linear regression involves finding the ―best" line to fit two variables, so that
one variable can be used to predict the other. Multiple linear regression is an extension of
linear regression, where more than two variables are involved.
If x = 0 then y = 0 + 1 = 1

UNIT - I
22
DATA CLEANING AS A PROCESS: The data cleaning is a process which is used to
‗clean‘ the dirty data or noisy data, remove artilers and resolving inconsistency if there is
any type of dirty data available.
The first step in data cleaning as a process is discrepancy detection. This discrepancy
may be caused by various reasons, which encompasses poorly designed data entry that
have many optional fields, human error while data has been entering deliberate errors,
and decay. This discrepancy may also lead to system errors, errors in the processing time,
it may also cause due to inconsistency data representation.
 Data discrepancy detection
 Use metadata (e.g., domain, range, dependency, distribution)
 Check field overloading
 Check uniqueness rule, consecutive rule and null rule
 Use commercial tools
Meta data which is data about data we can avoid the discrepancy. Which is used to
avoid the unnecessary data from the data sets.
Field Overloading is another error that occurs when the new attributes are defined into
unused portions of defined attributes.
Unique Rule is a rule which have different a unique given value from all other values
of that attribute.
Consecutive rule is a firstly unique rule and the values between the lowest and highest
there should not be any missing values for the attributes.
Null Rule indicates the use of question marks, blanks, special characters are other
strings which may specifies the null condition.
Commercial tools :
 Data scrubbing tools: use simple domain knowledge (e.g., postal code, spell-
check) to detect errors and make corrections
 Data auditing tools: by analyzing data to discover rules and relationship to detect
violators (e.g., correlation and clustering to find outliers)
 Data migration tools: allow transformations to be specified
 ETL (Extraction/Transformation/Loading) tools: allow users to specify
transformations through a graphical user interface.
 DATA INTEGRATION
It is likely that your data analysis task will involve data integration, which combines
data from multiple sources into a coherent data store, as in data warehousing. These
sources may include multiple databases, data cubes, or flat files.
There are a number of issues to consider during data integration.
 The methods/ procedures for smooth data integration are
 Metadata
 Correlation analysis
 Data Conflict detection

UNIT - I
23
Schema integration can be an issue. In real world entities from multiple data
sources are combined then it will be raised. This is referred to as the entity identification
problem. For example, how can the data analyst or the computer be sure that customer_id
in one database, and cust_number in another refer to the same entity? Databases and data
warehouses typically have metadata - that is, data about the data. Such metadata can be
used to help avoid errors in schema integration.
Redundancy is another important issue. An attribute may be redundant if it can
be ―derived" from another table, such as annual revenue. Inconsistencies in attribute or
dimension naming can also cause redundancies in the resulting data set.
Some redundancies can be detected by correlation analysis. For example, given
two attributes, such analysis can measure how strongly one attribute implies the other,
based on the available data. The correlation between attributes A and B can be measured
by
P(A ^ B)
P(A)P(B)
If the resulting value of Equation is greater than 1, then A and B are positively
correlated. The higher the value, the more each attribute implies the other. Hence, a high
value may indicate that A (or B) may be removed as a redundancy. If the resulting value
is equal to 1, then A and B are independent and there is no correlation between them. If
the resulting value is less than 1, then A and B are negatively correlated. This means that
each attribute discourages the other.
 Correlation Analysis is a statistical technique used to describe not only degree of
relationship b/w the variables but also the direction of co-variation. It deals with
the association between two or more variables.
 For numerical attributes, correlation between two attributes computed by
Correlation coefficient.
Correlation Analysis (Numerical Data):
Correlation coefficient (also called Pearson‘s product moment coefficient)
where n is the number of tuples, and are the respective means of A and B,
σA and σB are the respective standard deviation of A and B, and Σ(AB) is the sum of the
AB cross-product.
 If rA,B > 0, A and B are positively correlated (A‘s values increase as B‘s). The
higher, the stronger correlation.
 rA,B = 0: independent; rA,B < 0: negatively correlated
BABA n
BAnAB
n
BBAA
r BA
 )1(
)(
)1(
))((
,







A B

UNIT - I
24
Correlation Analysis (Categorical Data) :
 Χ2
(chi-square) test
 The larger the Χ2
value, the more likely the variables are related
 The cells that contribute the most to the Χ2
value are those whose actual count is
very different from the expected count
 Correlation does not imply causality
 No of hospitals and No of car-theft in a city are correlated
 Both are causally linked to the third variable: population
 Expected frequency can be computed as
eij = count(A= ai) x count(B = bj)
N
where N is number f tuples, count(A = ai) is no of tuples having value ai for A,
and count(B= bj) is no of tuples having value bj for B.
 The Χ2
statistic tests the hypothesis that A and B are independent. The test is
based on a significance level, with (r – 1) x (c -1 ) degrees of freedom. If the
hypothesis can be rejected, then A and B are statistically related or associated.
Chi-Square Calculation: An Example
Male Female Sum
(row)
Like science fiction 250(90) 200(360) 450
Not like science
fiction
50(210) 1000(840) 1050
Sum(col.) 300 1200 1500
by using th formulae we have to compute expected frequency.
e11 = (300 * 450 ) / 1500 = 90
e12 = (300 * 1050 ) / 1500 = 210
e21 = (1200 * 450 ) / 1500 = 360
e22 = (1200 * 1050 ) / 1500 = 840
Χ2
(chi-square) calculation (numbers in parenthesis are expected counts calculated based
on the data distribution in the two categories)



Expected
ExpectedObserved 2
2 )(

93.507
840
)8401000(
360
)360200(
210
)21050(
90
)90250( 2222
2










UNIT - I
25
 It shows that like science_fiction and gender are correlated in the group
 Since 507.93 > 10.828 then the hypothesis is rejected that gender and prefered
reading are independent. The two attributes are correlated.
 In the above contingency table, the degrees of freedom is (2-1) (2-1) = 1. Then
for 1 degree of freedom, the Χ2 (chi-square) value has to reject the hypothesis at
0.001, significance level is 10.828
 DATA TRANSFORMATION :
In data transformation, the data are transformed or consolidated into forms
appropriate for mining. Data transformation can involve the following:
1. Smoothing, which works to remove the noise from data. Such techniques include
binning, clustering, and regression.
 Regression analysis means the estimation of prediction of the unknown value of
one variable from the new value of other variable.
 Clustering is a mechanism to detect the outliers.
 Binning is a method which have a data , that data is stored and make smooth.
These data values seek information from it‘s ―neighborhoods‖. These sorted data
values are consolidated into a number of ―buckets‖ or ―bins‖. In this method we
have ‗smoothing by bin means‘ i.e, the values in the bin are replaced by its mean.
(Note: In detail explained above topic: Data cleaning)
2. Aggregation, where summary or aggregation operations are applied to the data. For
example, the daily sales data may be aggregated so as to compute monthly and annual
total amounts. This step is typically used in constructing a data cube for analysis of the
data at multiple granularities.
For example, if the organization want to resolve the average of their employees annual
income.
emp_no Annual_income
100 25,00,000
101 30,00,000
103 5,00,000
By using avg( ) in the database we can easily resolve.
3. Generalization of the data, where low level or 'primitive' (raw) data are replaced by
higher level concepts through the use of concept hierarchies. For example, categorical
attributes, like street, can be generalized to higher level concepts, like city or county.

UNIT - I
26
Similarly, values for numeric attributes, like age, may be mapped to higher level
concepts, like young, middle-aged, and senior.
4. Normalization, where the attribute data are scaled so as to fall within a small specified
range, such as -1.0 to 1.0, or 0 to 1.0.
In this the attributes should easy and not length. i.e., for instance , let us consider an
employee table. In general Annual salary is too larger. Using Normalization these values
can be specified as Annual salary (in lakhs).
• In Normalization there are 3 types of methods.
i. min-max normalization
ii. z-score normalization
iii. normalization by decimal scaling
 Min-max normalization: to [new_minA, new_maxA]
It is a method which performs a linear transformation on the original data.
Suppose that minA and maxA are the minimum and maximum values of an attribute A.
Min-max normalization maps a value V of A to V‘ in the range [new minA; new maxA]
by computing.
Min-max normalization preserves the relationships among the original data values. It
will encounter ―an out of bounds" error if a future input case for normalization falls
outside of the original data range for A.
Ex : Suppose that the maximum and minimum values for the attribute income are
98,000 and 12,000. Let income range 12,000 to 98,000 normalized to [0.0, 1.0].
Then 73,000 is mapped to
 Z-score normalization (or zero-mean normalization), the values for an attribute A are
normalized based on the mean and standard deviation of A. A value V of A is
normalized to V‘ by computing
Let μ = 54,000, σ = 16,000. Then
 Normalization by decimal scaling : normalization is useful when the actual
minimum and maximum of attribute A are unknown, or when there are outliers
which dominate the min-max normalization.
AAA
AA
A
minnewminnewmaxnew
minmax
minv
v _)__(' 



716.00)00.1(
000,12000,98
000,12600,73



A
Av
v


'
225.1
000,16
000,54600,73


j
v
v
10
'

UNIT - I
27
Where j is the smallest integer such that Max(|ν‘|) < 1
Ex: Suppose the recorded values of A range from -986 to 917. So the maximum
absolute value of A is 986. To normalize by decimal scaling we divide each value by
1000 (i.e., j = 3).
- -0.986 and 0.917
 DATA REDUCTION
 Why data reduction?
 A database/data warehouse may store terabytes of data
 Complex data analysis/mining may take a very long time to run on the
complete data set
 Data reduction
 It is a process to ‗reduce‘ the irrelevant data from the data sets. It should be
in a smaller volume and do not violate the original data.
 Data reduction strategies
 Data cube aggregation: , where aggregation operations are applied to the data
in the construction of a data cube.
 Attribute subset selection — e.g., remove unimportant attributes
 Dimensionality reduction (Data compression) – where irrelevant, weakly
relevant, or redundant attributes or dimensions may be detected and removed .
e.g. Using Encoding Schemes
 Numerosity reduction: where the data are replaced or estimated by alternative,
smaller data representations such as parametric models (which need store only
the model parameters instead of the actual data), or nonparametric methods
such as clustering, sampling, and the use of histograms. — e.g., fit data into
models
 Discretization and concept hierarchy generation, where raw data values for
attributes are replaced by ranges or higher conceptual levels. Concept
hierarchies allow the mining of data at multiple levels of abstraction, and are a
powerful tool for data mining. --- e.g. It is also a form of data reduction
process but which automatically generates ‗hierarchies‘ for numerical data.
1.Data Cube Aggregation:
 In this type of strategy the Data buses are constructed by performing same
operations.
 E.g., For instance in an organization the sale report have been computed
according to quarters for the year 1997- 1999. Their interest is to make the
annual sales. So the data cane be aggregated and the resulting data has described
in the total sales per year.

UNIT - I
28
 The below tables represent that all the sales are in quarterly wise. Now we have
to determine and integrate as Yearly wise.
• Data cubes store Multi dimensional aggregated information
Ex: The data cubes store multi dimensional analysis of sales data.
2. Attribute Subset Selection :
 Feature selection (i.e., attribute subset selection):
 The attribute subset selection reduces the data size by removing or
irrelevant or redundant attributes.
 The main objective of this method is to find a minimum set of attributes
and the resulting data classes has to close a relative to the original data
analyzation.
 Data sets for analysis may encompasses many attributes, in which some of
they may be irrelevant or redundant to the mining task.
 Heuristic methods (due to exponential no of choices):
 Step-wise forward selection
 Step-wise backward elimination
 Combining forward selection and backward elimination
 Decision-tree induction

UNIT - I
29
Step-wise forward selection: The procedure starts with an empty set of attributes. The
best of the original attributes is determined and added to the set. At each subsequent
iteration or step, the best of the remaining original attributes is added to the set.
Step-wise backward elimination: The procedure starts with the full set of attributes. At
each step, it removes the worst attribute remaining in the set.
Combination forward selection and backward elimination: The step-wise forward
selection and backward elimination methods can be combined, where at each step one
selects the best attribute and removes the worst from among the remaining attributes.
Decision tree induction: Decision tree induction constructs a flow-chart-like structure
where each internal (non-leaf) node denotes a test on an attribute, each branch
corresponds to an outcome of the test, and each external (leaf) node denotes a class
prediction.
3. Data Compression: In data compression, data encoding or transformations are
applied so as to obtain a reduced or ―compressed" representation of the original data. If
the original data can be reconstructed from the compressed data without any loss of
information, the data compression technique used is called lossless. If, instead, we can
reconstruct only an approximation of the original data, then the data compression
technique is called lossy.

UNIT - I
30
There are several well-tuned algorithms for string compression. Although they are
typically lossless, they allow only limited manipulation of the data. In this section, we
instead focus on two popular and effective methods of lossy data compression: wavelet
transforms, and principal components analysis.
 Wavelet transforms:
The discrete wavelet transform (DWT) is a linear signal processing technique
that, when applied to a data vector D, transforms it to a numerically different vector, D0,
of wavelet coefficients. The two vectors are of the same length.
The DWT is closely related to the discrete Fourier transform (DFT), a signal
processing technique involving sines and cosines. (DWT achieves lossy compression)
• The general algorithm for a discrete wavelet transform is as follows:
1) Length, L, must be an integer power of 2 (padding with 0‘s, when
necessary)
2) Each transform has 2 functions: The first applies some data smoothing.
The second performs weighted difference.
3) The two functions are applied to pairs of the input data, resulting in two
sets of data of length L/2.
4) The two functions are recursively applied to the sets of data obtained in
the previous loop.
5) A selection of values from the data sets obtained in the above iterations
are designated the wavelet coefficients of the transformed data.
DWT for Image Compression:
 Image
Low Pass High Pass
Low Pass High Pass
Low Pass High Pass

UNIT - I
31
 Dimensionality Reduction: Principal Component Analysis (PCA)
 Principal Components Analysis as a method of data compression.
 The basic procedure as follows:
 Normalize input data: Each attribute falls within the same range
 Compute k orthonormal (unit) vectors, i.e., principal components
 Each input data (vector) is a linear combination of the k principal
component vectors
 The principal components are sorted in order of decreasing ―significance‖
or strength
 Since the components are sorted, the size of the data can be reduced by
eliminating the weak components, i.e., those with low variance. (i.e.,
using the strongest principal components, it is possible to reconstruct a
good approximation of the original data
 Works for numeric data only
 Used when the number of dimensions is large
4. Numerosity Reduction:
Can we reduce the data volume by choosing alternative, `smaller' forms of data
representation?" Techniques of numerosity reduction can indeed be applied for this
purpose. These techniques may be parametric or non-parametric.
For parametric methods, a model is used to estimate the data, so that typically only the
data parameters need be stored, instead of the actual data. (Outliers may also be stored).
Log-linear models, which estimate discrete multidimensional probability distributions,
are an example.
Non-parametric methods for storing reduced representations of the data include
histograms, clustering, and sampling.
 Regression and log-linear models can be used to approximate the given data. In
linear regression, the data are modeled to a straight line. For example, a random
variable, Y (called a response variable), can be modeled as a linear function of another
random variable, X (called a predictor variable), with the equation
Y = α + βX;

UNIT - I
32
where the variance of Y is assumed to be constant. The coefficients α and β (called
regression coefficients) specify the Y -intercept and slope of the line, respectively.
These coefficients can be solved for by the method of least squares, which minimizes
the error between the actual line separating the data and the estimate of the line.
Multiple regression is an extension of linear regression allowing a response variable
Y to be modeled as a linear function of a multidimensional feature vector.
Log-linear models approximate discrete multidimensional probability distributions.
The method can be used to estimate the probability of each cell in a base cuboid for a
set of discretized attributes, based on the smaller cuboids making up the data cube
lattice. This allows higher order data cubes to be constructed from lower order ones.
Log-linear models are therefore also useful for data compression (since the smaller
order cuboids together typically occupy less space than the base cuboid) and data
smoothing (since cell estimates in the smaller order cuboids are less subject to
sampling variations than cell estimates in the base cuboid).
 Histograms : Histograms use binning to approximate data distributions and are a
popular form of data reduction. A histogram for an attribute A partitions the data
distribution of A into disjoint subsets, or buckets. The buckets are displayed on a
horizontal axis, while the height (and area) of a bucket typically reects the average
frequency of the values represented by the bucket. If each bucket represents only a
single attribute-value/frequency pair, the buckets are called singleton buckets. Often,
buckets instead represent continuous ranges for the given attribute.
 Divide data into buckets and store average (sum) for each bucket
 Partitioning rules:
 Equal-width: equal bucket range
 Equal-frequency (or equal-depth)
 V-optimal: with the least histogram variance (weighted sum of the original values
that each bucket represents)
 MaxDiff: set bucket boundary between each pair for pairs have the β–1 largest
differences
A histogram for price where values are aggregated so that each bucket has a uniform
width of $10.

UNIT - I
33
 Clustering : Clustering techniques consider data tuples as objects. They partition the
objects into groups or clusters, so that objects within a cluster are ―similar" to one
another and ―dissimilar" to objects in other clusters. Similarity is commonly defined in
terms of how ―close" the objects are in space, based on a distance function. The
―quality" of a cluster may be represented by its diameter, the maximum distance
between any two objects in the cluster. Centroid distance is an alternative measure of
cluster quality, and is defined as the average distance of each cluster object from the
cluster centroid (denoting the ―average object", or average point in space for the
cluster).
 Sampling :
Sampling can be used as a data reduction technique since it allows a large data set to
be represented by a much smaller random sample (or subset) of the data. Suppose that
a large data set, D, contains N tuples. Let's have a look at some possible samples for
D.
Sampling: with or without Replacement

UNIT - I
34
1. Simple random sample without replacement (SRSWOR) of size n: This is created
by drawing n of the N tuples from D (n < N), where the probably of drawing any tuple in
D is 1=N, i.e., all tuples are equally likely.
2. Simple random sample with replacement (SRSWR) of size n: This is similar to
SRSWOR, except that each time a tuple is drawn from D, it is recorded and then
replaced. That is, after a tuple is drawn, it is placed back in D so that it may be drawn
again.
3. Cluster sample: If the tuples in D are grouped into M mutually disjoint ―clusters",
then a SRS of m clusters can be obtained, where m < M. For example, tuples in a
database are usually retrieved a page at a time, so that each page can be considered a
cluster. A reduced data representation can be obtained by applying, say, SRSWOR to the
pages, resulting in a cluster sample of the tuples.
4. Stratified sample: If D is divided into mutually disjoint parts called strata", a
stratified sample of D is generated by obtaining a SRS at each stratum. This helps to
ensure a representative sample, especially when the data are skewed. For example, a
stratified sample may be obtained from customer data, where stratum is created for each

UNIT - I
35
customer age group. In this way, the age group having the smallest number of customers
will be sure to be represented.
5. Discretization and concept hierarchy generation :
It is difficult and tedious to specify concept hierarchies for numeric attributes due to the
wide diversity of possible data ranges and the frequent updates of data values.
Concept hierarchies for numeric attributes can be constructed automatically based on data
distribution analysis.
 Three types of attributes:
 Nominal — values from an unordered set, e.g., color, profession
 Ordinal — values from an ordered set, e.g., military or academic rank
 Continuous — real numbers, e.g., integer or real numbers
 Discretization:
 Divide the range of a continuous attribute into intervals
 Some classification algorithms only accept categorical attributes.
 Reduce data size by discretization
 Prepare for further analysis
We examine five methods for numeric concept hierarchy generation. These include
binning, histogram analysis.
(i) Binning.
These methods are also forms of discretization. For example, attribute values can be
discretized by replacing each bin value by the bin mean or median, as in smoothing by
bin means or smoothing by bin medians, respectively. These techniques can be applied
recursively to the resulting partitions in order to generate concept hierarchies.
(ii) Histogram analysis. Histograms, (as already discussed ), can also be used for
discretization. Figure 3.13 presents a histogram showing the data distribution of the
attribute price for a given data set. For example, the most frequent price range is roughly
$300-$325. Partitioning rules can be used to define the ranges of values. For instance, in
an equi-width histogram, the values are partitioned into equal sized partitions or ranges
(e.g., ($0-$100], ($100-$200], . . , ($900-$1,000]). With an equi-depth histogram, the
values are partitioned so that, ideally, each partition contains the same number of data
samples. The histogram analysis algorithm can be applied recursively to each partition in
order to automatically generate a multilevel concept hierarchy, with the procedure
terminating once a pre-specified number of concept levels has been reached. A minimum
interval size can also be used per level to control the recursive procedure. This specifies
the minimum width of a partition, or the minimum number of values for each partition at
each level.

UNIT - I
36
(iii) Clustering analysis. A clustering algorithm can be applied to partition data into
clusters or groups. Each cluster forms a node of a concept hierarchy, where all nodes are
at the same conceptual level. Each cluster may be further decomposed into several
subclusters, forming a lower level of the hierarchy. Clusters may also be grouped
together in order to form a higher conceptual level of the hierarchy.
(iv) Entropy-based discretization. An information-based measure called entropy" can
be used to recursively partition the values of a numeric attribute A, resulting in a
hierarchical discretization. Such a discretization forms a numerical concept hierarchy for
the attribute. Given a set of data tuples, S, the basic method for entropy-based
discretization of A is as follows.
 Each value of A can be considered a potential interval boundary or threshold T.
For example, a value v of A can partition the samples in S into two subsets
satisfying the conditions A < v and A >= v, respectively, thereby creating a binary
discretization.
 Given S, the threshold value selected is the one that maximizes the information
gain resulting from the subsequent partitioning.

UNIT - I
37
The information gain is:
 Entropy is calculated based on class distribution of the samples in the set. Given
m classes, the entropy of S1 is
where pi is the probability of class i in S1
 The boundary that minimizes the entropy function over all possible boundaries is
selected as a binary discretization
 The process is recursively applied to partitions obtained until some stopping
criterion is met
Such a boundary may reduce data size and improve classification accuracy.
(v) Segmentation by Natural Partitioning
Although binning, histogram analysis, clustering and entropy-based discretization are
useful in the generation of numerical hierarchies, many users would like to see
numerical ranges partitioned into relatively uniform, easy-to-read intervals that
appear intuitive or natural". For example, annual salaries broken into ranges like
[$50,000, $60,000) are often more desirable than ranges like [$51263.98, $60872.34),
obtained by some sophisticated clustering analysis.
 A simply 3-4-5 rule can be used to segment numeric data into relatively uniform,
―natural‖ intervals.
 If an interval covers 3, 6, 7 or 9 distinct values at the most significant
digit, partition the range into 3 equi-width intervals
 If it covers 2, 4, or 8 distinct values at the most significant digit, partition
the range into 4 intervals
 If it covers 1, 5, or 10 distinct values at the most significant digit, partition
the range into 5 intervals
)(
||
||
)(
||
||
),( 2
2
1
1
SEntropy
S
S
SEntropy
S
S
TSI 


m
i
ii ppSEntropy
1
21 )(log)(

UNIT - I
38
The results of applying the 3-4-5 rule are shown in Figure 3.14.
 Step 1: Based on the above information, the minimum and maximum values
are: MIN = -$351; 976:00, and MAX = $4; 700; 896:50. The low (5%-tile)
and high (95%-tile) values to be considered for the top or first level of
segmentation are: LOW = -$159; 876, and HIGH = $1; 838; 761.
 Step 2: Given LOW and HIGH, the most significant digit is at the million
dollar digit position (i.e., msd = 1,000,000). Rounding LOW down to the
million dollar digit, we get LOW‘ = -$1; 000; 000; and rounding HIGH up to
the million dollar digit, we get HIGH‘ = +$2; 000; 000.
 Step 3: Since this interval ranges over 3 distinct values at the most significant
digit, i.e., (2,000,000- (-1,000,000)) / 1,000,000 = 3, the segment is partitioned
into 3 equi-width subsegments according to the 3-4-5 rule: (-$1,000,000 / $0],
($0 / $1,000,000], and ($1,000,000 / $2,000,000]. This represents the top tier
of the hierarchy.
 Step 4: We now examine the MIN and MAX values to see how they ―fit" into
the first level partitions.
 Step 5: Recursively, each interval can be further partitioned according to the
3-4-5 rule to form the next lower level of the hierarchy:
Concept Hierarchy Generation for Categorical Data :
 Specification of a partial/total ordering of attributes explicitly at the schema level
by users or experts
street < city < state < country
 Specification of a hierarchy for a set of values by explicit data grouping
{Urbana, Champaign, Chicago} < Illinois

UNIT - I
39
 Specification of only a partial set of attributes
E.g., only street < city, not others
 Automatic generation of hierarchies (or attribute levels) by the analysis of the number
of distinct values
E.g., for a set of attributes: {street, city, state, country}
Automatic Concept Hierarchy Generation:
 Some hierarchies can be automatically generated based on the analysis of the
number of distinct values per attribute in the data set
 The attribute with the most distinct values is placed at the lowest level of
the hierarchy
 Exceptions, e.g., weekday, month, quarter, year
Summary :
 Data preparation is an important issue for both data warehousing and data mining, as
real-world data tends to be incomplete, noisy, and inconsistent. Data preparation includes
data cleaning, data integration, data transformation, and data reduction.
 Data cleaning routines can be used to fill in missing values, smooth noisy data, identify
outliers, and correct data inconsistencies.
 Data integration combines data from multiples sources to form a coherent data store.
Metadata, correlation analysis, data conflict detection, and the resolution of semantic
heterogeneity contribute towards smooth data integration.
 Data transformation routines conform the data into appropriate forms for mining. For
example, attribute data may be normalized so as to fall between a small range, such as 0
to 1.0.
 Data reduction techniques such as data cube aggregation, dimension reduction, data
compression, numerosity reduction, and discretization can be used to obtain a reduced
representation of the data, while minimizing the loss of information content.
 Concept hierarchies organize the values of attributes or dimensions into gradual levels of
abstraction. They are a form a discretization that is particularly useful in multilevel
mining.
 Automatic generation of concept hierarchies for categoric data may be based on the
number of distinct values of the attributes defining the hierarchy. For numeric data,
techniques such as data segmentation by partition rules, histogram analysis, and
clustering analysis can be used.
 Although several methods of data preparation have been developed, data preparation
remains an active area of research.

Dm unit i r16

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Dm unit i r16

Ähnlich wie Dm unit i r16 (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Dm unit i r16