Data preprocessing techniques
See my Paris applied psychology conference paper here
https://www.slideshare.net/jasonrodrigues/paris-conference-on-applied-psychology
or
https://prezi.com/view/KBP8JnekVH9LkLOiKY3w/
2. âą Introduction
âą Why data proprocessing?
âą Data Cleaning
âą Data Integration and
Transformation
âą Data Reduction
âą Discretization and concept
Heirarchy generation
âą Takeaways
Agenda
3. Why Data Preprocessing?
Data in the real world is dirty
ïŹ
incomplete: lacking attribute values, lacking certain attributes
of interest, or containing only aggregate data
ïŹ
noisy: containing errors or outliers
ïŹ
inconsistent: containing discrepancies in codes or names
No quality data, no quality mining results!
ïŹ
Quality decisions must be based on quality data
ïŹ
Data warehouse needs consistent integration of quality data
A multi-dimensional measure of data quality
ïŹ
A well-accepted multi-dimensional view:
ïŹ
accuracy, completeness, consistency, timeliness, believability, value
added, interpretability, accessibility
Broad categories
ïŹ
intrinsic, contextual, representational, and accessibility
4. Data Preprocessing
Major Tasks of Data Preprocessing
Data cleaning
ïŹ
Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
Data integration
ïŹ
Integration of multiple databases, data cubes, files, or notes
Data trasformation
ïŹ
Normalization (scaling to a specific range)
ïŹ
Aggregation
Data reduction
ïŹ
Obtains reduced representation in volume but produces the same or
similar analytical results
ïŹ
Data discretization: with particular importance, especially for numerical
data
ïŹ
Data aggregation, dimensionality reduction, data compression,
generalization
5. Data Preprocessing
Major Tasks of Data Preprocessing
Data Cleaning
Data Integration
Databases
Data
Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
6. Data Cleaning
Tasks of Data Cleaning
ïŹ
Fill in missing values
ïŹ
Identify outliers and smooth noisy data
ïŹ
Correct inconsistent data
7. Data Cleaning
Manage Missing Data
ïŹ
Ignore the tuple: usually done when class label is missing (assuming the
task is classificationânot effective in certain cases)
ïŹ
Fill in the missing value manually: tedious + infeasible?
ïŹ
Use a global constant to fill in the missing value: e.g., âunknownâ, a
new class?!
ïŹ
Use the attribute mean to fill in the missing value
ïŹ
Use the attribute mean for all samples of the same class to fill in
the missing value: smarter
ïŹ
Use the most probable value to fill in the missing value: inference-
based such as regression, Bayesian formula, decision tree
8. Data Cleaning
Manage Noisy Data
Binning Method:
ïŹ
first sort data and partition into (equi-depth) bins
ïŹ
then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc
Clustering:
ïŹ
detect and remove outliers
Semi Automated
ïŹ
Computer and Manual Intervention
Regression
ïŹ
Use regression functions
10. Data Cleaning
Regression Analysis
x
y
y = x + 1
X1
Y1
Y1â
âąLinear regression (best line to fit
two variables)
âąMultiple linear regression (more
than two variables, fit to a
multidimensional surface
11. Data Cleaning
Inconsistant Data
ïŹ
Manual correction using external
references
ïŹ
Semi-automatic using various tools
â To detect violation of known functional
dependencies and data constraints
â To correct redundant data
12. Data integration and transformation
Tasks of Data Integration and transformation
ïŹ
Data integration:
â combines data from multiple sources into a coherent
store
ïŹ
Schema integration
â integrate metadata from different sources
â Entity identification problem: identify real world entities
from multiple data sources, e.g., A.cust-id ⥠B.cust-#
ïŹ
Detecting and resolving data value conflicts
â for the same real world entity, attribute values from
different sources are different
â possible reasons: different representations, different
scales, e.g., metric vs. British units, different currency
13. Manage Data Integration
Data integration and transformation
ïŹ
Redundant data occur often when integrating multiple DBs
â The same attribute may have different names in different databases
â One attribute may be a âderivedâ attribute in another table, e.g., annual
revenue
ïŹ
Redundant data may be able to be detected by correlational analysis
âą Careful integration can help reduce/avoid redundancies and
inconsistencies and improve mining speed and quality
BA
BA
n
BBAA
r
ÏÏ)1(
))((
,
â
ââÎŁ
=
14. Manage Data Transformation
Data integration and transformation
ïŹ Smoothing: remove noise from data (binning, clustering,
regression)
ïŹ Aggregation: summarization, data cube construction
ïŹ Generalization: concept hierarchy climbing
ïŹ Normalization: scaled to fall within a small, specified range
â min-max normalization
â z-score normalization
â normalization by decimal scaling
ïŹ Attribute/feature construction
â New attributes constructed from the given ones
15. Manage Data Reduction
Data reduction
Data reduction: reduced representation, while still retaining critical
information
ïŹ
Data cube aggregation
ïŹ
Dimensionality reduction
ïŹ
Data compression
ïŹ
Numerosity reduction
ïŹ
Discretization and concept hierarchy generation
16. Data Cube Aggregation
Data reduction
ïŹ
Multiple levels of aggregation in data cubes
â Further reduce the size of data to deal with
ïŹ
Reference appropriate levels Use the smallest representation capable
to solve the task
17. Data Compression
Data reduction
ïŹ
String compression
â There are extensive theories and well-tuned algorithms
â Typically lossless
â But only limited manipulation is possible without expansion
ïŹ
Audio/video, image compression
â Typically lossy compression, with progressive refinement
â Sometimes small fragments of signal can be reconstructed
without reconstructing the whole
ïŹ
Time sequence is not audio
â Typically short and vary slowly with time
``
19. Similarities and Dissimilarities
Proximity
ïŹ
Proximity is used to refer to Similarity or
Dissimilarity, since proximity between the
object is a function of proximity between
the corresponding attributes of two objects.
ïŹ
Similarity: Numeric measure of the degree
to which the two objects are alike.
ïŹ
Dissimilarity: Numeric measure of the
degree to which the two objects are
different.
20. Dissimilarities between Data Objects?
ïŹ
Similarity
â Numerical measure of how alike two data
objects are.
â Is higher when objects are more alike.
â Often falls in the range [0,1]
ïŹ
Dissimilarity
â Numerical measure of how different are two data
objects
â Lower when objects are more alike
â Minimum dissimilarity is often 0
â Upper limit varies
ïŹ
Proximity refers to a similarity or dissimilarity
21. Euclidean Distance
ïŹ
Euclidean Distance
Where n is the number of dimensions (attributes)
and pk and qk are, respectively, the kth
attributes
(components) or data objects p and q.
ïŹ
Standardization is necessary, if scales differ.
â
=
â=
n
k
kk qpdist
1
2
)(
23. Minkowski Distance
ïŹ r = 1. City block (Manhattan, taxicab, L1
norm)
distance.
â A common example of this is the Hamming distance, which is just
the number of bits that are different between two binary vectors
ïŹ
r = 2. Euclidean distance
ïŹ r â â. âsupremumâ (Lmax
norm, Lâ
norm) distance.
â This is the maximum difference between any component of the
vectors
â Example: L_infinity of (1, 0, 2) and (6, 0, 3) = ??
â Do not confuse r with n, i.e., all these distances are
defined for all numbers of dimensions.
25. Euclidean Distance Properties
âą Distances, such as the Euclidean distance,
have some well known properties.
1. d(x, y) â„ 0 for all x and y and d(x, y) = 0 only if
x = y. (Positive definiteness)
2. d(x, y) = d(y, x) for all x and q. (Symmetry)
3. d(x, y) †d(x, y) + d(y, z) for all points x, y, and z.
(Triangle Inequality)
where d(x, y) is the distance (dissimilarity) between
points (data objects), x and y.
âą A distance that satisfies these properties is a
metric, and a space is called a metric space
26. Non Metric Dissimilarities â Set Differences
ïŹ
non-metric measures are often robust
(resistant to outliers, errors in objects, etc.)
â the symmetry and mainly the triangular inequality
are often violated
ïŹ
cannot be directly
used with MAMs
a
b
a > b + c
c
a
b
a â b
27. Non Metric Dissimilarities â Time
ïŹ
various k-median distances
â measure distance between the two (k-th) most
similar portions in objects
ïŹ
COSIMIR
â back-propagation network with single output
neuron serving as a distance, allows training
ïŹ
Dynamic Time Warping distance
â sequence alignment technique
â minimizes the sum of distances between
sequence elements
ïŹ fractional Lp distances
â generalization of Minkowski distances (p<1)
â more robust to extreme differences in coordinates