2. Extraction or ‘mining’ of large amount of data
Also known as knowledge mining from data / knowledge extraction / data or
pattern analysis / data archaeology / data dredging
Most popular – Knowledge Discovery from Data (KDD)
Data available in huge amount -> Imminent need for turning into useful info
Application – market analysis, fraud detection, customer retention, production
control, science exploration
2
3. Data cleaning (remove noise and inconsistent data)
Data integration (combine multiple data sources)
Data selection (relevant data is retrieved from database)
Data transformation (data is transformed or consolidated by mining/aggregation)
Data mining (extraction of data patterns)
Pattern evaluation (identifying interesting patterns representing knowledge using
interestingness measures)
Knowledge presentation (visualization and presentation of mined knowledge)
3
5. Database, Data Warehouses, WWW, Information Repositories – It may be a set of
databases/warehouses or any other information repositories. Data cleaning and
data integration is performed.
Database / Data Warehouse servers – responsible for fetching relevant data based
on user’s request
Knowledge base – it’s the domain knowledge that guides the search. Includes
concept hierarchies used to organize attributes, user believes
Data mining engine – consist of functional modules for task such as
characterization, association and correlation analysis, classification, prediction,
cluster analysis, outlier analysis.
Pattern evaluation module – employs interestingness measures and interactive
with data mining modules to focus the search towards interesting patterns
User interface – user specifies a data mining query or task, providing information
to help focus search and perform exploratory data mining based on intermediate
data mining results.
5
6. Relational Databases
Data Warehouses
Transactional Databases
Advanced Data and Information Systems and Advanced Application
Object-Relational Database
Temporal Database/Sequence Database and Time-Series Database
Spatial Databases and Spatiotemporal Databases
Text Databases and Multimedia Databases
Heterogeneous Databases and Legacy Databases
Data Streams
World Wide Web
6
7. No coupling
DM system does not utilize any function of DB/DW.
Fetches data from source and stores result in different file
Drawbacks
Without a DB system, a DM system spends time in searching, collecting, transforming data.
DM systems doesn’t have any tested, scalable algorithm or data structure implemented
DM systems needs another tool to extract data
Loose coupling
DM system will use some feature of DB system like fetching data, performing data
mining and storing the results in a file/place in database
Advantage
Fetch data from database using query processing, indexing
Has advantages of flexibility, efficiency by the system.
Disadvantage – mining does not explore data structure/query optimization methods 7
8. Semi-tight coupling
Linking of DM system to DB system and efficient implementation of a few essential data
mining primitives is provided by DB
Includes sorting, indexing, aggregation, histogram analysis, pre-computation of
statistical measures like sum, count, min-max, standard deviation
Enhances performance of DM system since some frequently used results is pre-computed
Tight coupling
DM system is smoothly integrated into DB system.
data mining queries and functionalities are optimized based on mining query analysis,
data structure, indexing schemes and query processing methods.
8
9. Why preprocess the data?
Incomplete (lacking attribute values)
Noisy (containing errors or outliers)
Inconsistent (containing discrepancies in department codes used to categorize them)
Redundancy (repetition of the same data)
Descriptive Data Summarization helps in the study of general characteristics of
the data and identifies the presence of noise or outliers which is useful for
successful for cleaning and data integration.
Measures of central tendency – mean, median, weighted arithmetic mean, mode
Measure of data dispersion – quartiles, interquartile range, variance
9
10. A distributive measure is a measure that can be computed for a given data set by
partitioning the data into smaller subsets, computing the measure for each subset
and then merging the result in order to arrive at the measure’s value for the
original dataset.
An algebraic measure is a measure that can be computed by applying an algebraic
function to one or more distributive measures.
A holistic measure is a measure that must be computed on the entire data set as a
whole. It cannot be computed by partitioning the given data into subsets and
merging the values obtained for the measure in each subset.
10
11. The degree to which the numerical data tend to spread is called dispersion or
variance of the data.
Most common measure of dispersion are range, five-number summary, inter
quartile range, standard deviation.
For displaying the data summary and dispersion popular graphs include –
histograms, quantile plots, q-q plots, scatter plots, loess curves.
11
13. Data cleaning tends to fill missing values, smooth out noise, identify outliers,
correct inconsistencies
Missing values
Ignore the tuple
Fill the missing value manually
Use a global constant to fill the missing value
Use the architecture mean to fill the missing value
Use the attribute mean for samples belonging to the same class as the given tuple
Use the most probable value to fill the missing value
Use regression, decision-tree induction, Bayesian formation
13
14. Noisy data
Binning
Consults the neighboring value
Performs local smoothing
Smoothing by bin means – each value of bin is replaced by mean value of the bin
Smoothing by bin median – each value of the bin is replaced by bin median
smoothing by bin boundaries – max and min value of bin is bin boundary and each value of
bin is replaced by the closest bin boundary
Regression
Filters the data into functions
Linear regression finds the best line to fit two attributes
Multiple regression involves more than two variables
Clustering
Outliers is detected through clustering where similar values are organized into clusters
Values falling off the set is outlier
14
15. Data integration
Entity identification problem is matching of equivalent real-world entries from multiple
data sources
Correlation analysis measures how strong one attribute implies the other
Data transformation
Smoothing – binning, regression, clustering
Aggregation
Generalization – low level data is replaced by higher level concept through the use of
concept hierarchy
Normalization – data is scaled to fall within a small specified range
Min-Max method
Z-score normalization
Decimal scaling
Attribute construction
15
16. Applied to obtain a reduced representation of data set
Data cube aggregation
Attribute subset selection reduces the data size by removing irrelevant or redundant
attribute.
Dimensionality reduction involves data encoding or transformation to obtain
compressed data. Lossy dimensionality reduction – wavelet transform, principal
component analysis
Numerosity reduction
Parametric methods use a model to estimate data ex. Log-Linear model
Nonparametric method include histogram, clustering and sample for storing reduced
representation
Discretization and concept hierarchy reduces the number of values for a given attribute
by dividing the range of the attribute into intervals.
16
17. DM task is divided into two categories: descriptive and predictive
Descriptive mining task characterizes general properties of the data
Predictive mining task performs inference on current data in order to make predictions
17
18. Data characterization is summarization of the general characterization or
features of the target class of data.
Data corresponding to user specific class are typically collected by database query
Example: to study the characteristics of software products whose sales increased by 10%,
data related to the product is collected
Data cube OLAP roll-up operation is used for data summarization
Output is presented in the form of pie charts, histogram
Data discrimination is comparison of the general features of target class data
objects with general features of the object from one or a set of contrasting class.
Example: comparison of a product whose sales increased by 10% with that of a product
whose sales decreased by 30%
18
19. Classification is the process of finding a model that describes or distinguishes data
classes or concepts for the purpose of being able to use the model to predict the
class of object whose class label is unknown.
Classifying loan as ‘safe’ or ‘risky’
Given a customer profile, guess whether he will buy a new computer
Decision tree induction
Bayesian classification
Rule-based classification
Classification by backpropogation
Support vector machines
Classification by association rule analysis
19
20. Prediction models continuous valued functions. Numeric prediction is the task of
predicting continues values for the given input.
Regression analysis is a statistical methodology that is often used for numerical
prediction
Linear/straight-line regression involves a response variable, y and a single
predictor variable, x. It models y as a function of x. [y=b+wx]
Multiple linear regression extends straight-line regression to models more than
one predictor variable
Nonlinear regression models polynomial terms
20
21. The process of grouping a set of physical or abstract objects into classes of similar
objects is called clustering.
A cluster is a collection of data objects that are similar to one another within the
same cluster and are dissimilar to the objects in other cluster.
Class labels are not present in training data because they are not known to begin
with. Clustering is used to generate such labels
Applications: taxonomy (organization of observations into hierarchy of classes that
group similar events together)
21
22. Partitioning method
Partitioning method creates k partitions of the database of n objects of data tuples
Requirements
Each group must contain at least one object
Each object must belong to exactly one group
Objects in the same cluster are close or related to each other whereas objects of different
cluster are fat apart or very different
k-means algorithm where each cluster is represented by the mean value of the objects
k-medoids algorithm where each cluster is represented by one of the objects located near
the center of the cluster.
works well for small to medium databases
22
23. Hierarchical method
Created hierarchical decomposition of the given set of data objects.
Classification based on how hierarchical decomposition is formed
Agglomerative/Bottom-up approach merges objects or groups that are close to one another, until
all the groups are merged into one
Divisive/Top-down approach starts with all of the objects in the same cluster. It breaks down into
smaller cluster until eventually each object is in one cluster
Density-based method
Can easily determine clusters of arbitrary shape
Used to filter out noise
Grid based method
Quantize the object space into a finite number of cells that form a grid structure.
Faster processing
Model based clustering
Hypothesizes a model for each cluster and finds the best fit of the data to the given
model
Locates cluster by constructing a density function that reflects spatial distribution of
data
Automatically determines the number of clusters based on standard statistics
Example: self organizing maps
23
24. Clustering high dimensional data
examines objects having a number of features
Subspace clustering method searches for clusters in subspace
Frequent pattern based clustering extracts distinct frequent patterns among subset of
dimensions that occur frequently
Constrain based clustering
Performs clustering by incorporating user-specific constrains
A constrain expresses a user’s expectations or desired results
Example: spatial clustering with the existence of obstacles and clustering under user
specific constrains
24
25. Outliers are data that do not comply with the general behavior or model of data
Its discarded by most data mining applications. However, in applications like
fraud detection, it worth noting. Example: fraudulent usage of credit cards by
detecting purchases extremely of extremely large amount on a given day
Outliers may be detected by using a statistical test for probability model or using
distance measure where objects that are a substantial distance from any other
cluster is considered outlier.
Evolution analysis describes and models regularities or trends for objects whose
behavior changes over time.
25
26. Massive data, temporally ordered, fast changing and potentially infinite is stream
data.
Stream data flow in and out of a computer system continuously and with varying
update rates.
Examples – real-time surveillance system, communication network, internet
traffic, on-line transactions in financial markets or retail industry, electric power
grids, industry production process and other dynamic environments.
It is impossible to store an entire data stream. Moreover, it tends to be of rather
low level of abstraction.
26
27. Mining time-series data
A time-series database consist of sequence of values spread over repeated measurements
of time.
Time-series database is popular in stock-market analysis, economic and sales
forecasting, budgetary analysis, utility studies, yield studies, work-load projections,
observation of natural phenomenon
Mining sequence patterns
A sequence database consist of sequence of ordered elements or events, recorded with or
without a concrete notion of time. Sequential pattern mining is the discovery of
frequently occurring ordered events or sequence of patterns.
Applications include customer shopping sequence, web clickstream, biological sequences,
sequences of events in science and engineering.
27
Hinweis der Redaktion
Steps 1-4 are different forms of data preprocessing
From a data warehouse perspective, data mining can be viewed as an advanced stage of on-line analytical processing (OLAP).