9. การจัดการค่าข้อมูลที่ขาดหาย (Missing value)
1. Ignore the tuple
2. Fill in the missing value manually
3. Use a global constant to fill in the missing value
4. Use the attribute mean to fill in the missing value
5. Use the attribute mean for all samples belonging to the same
class as the given tuple
6. Use the most propable value to fill in the missing value
10. Ignore the tuple
ตัดทิ้งรายการที่มีข้อมูลสูญหาย นิยมใช้กับการทาเหมืองข้อมูลแบบจาแนกประเภท
(Classification) ในกรณีที่ค่าคุณลักษณะขาดหายไปเป็นจานวนมาก
Fill in the missing value manually
เติมค่าที่ขาดหายด้วยมือ วิธีนี้ไม่เหมาะสมกรณีที่ชุดข้อมูลมีขนาดใหญ่ และมีข้อมูล
ขาดหายจานวนมาก
Use a global constant to fill in the missing value
เติมค่าคุณลักษณะของข้อมูลที่ขาดหายทุกค่า ด้วยค่าคงที่ค่าหนึ่ง เช่น ไม่รู้ค่า
หรือ unknown
11. Use the attribute mean to fill in the missing value
ใช้ค่าเฉลี่ยของคุณลักษณะ เติมค่าข้อมูลที่ขาดหาย เช่น ถ้าทราบว่าลูกค้าที่รายได้
เฉลี่ยเดือนละ 12000 บาท จะใช้ค่านี้แทนค่ารายได้ของลูกค้าที่ขาดหาย
Use the attribute mean for all samples belonging to the
same class as the given tuple
ใช้ค่าเฉลี่ยคุณลักษณะของตัวอย่างที่จัดอยู่ในประเภทเดียวกัน เพื่อเติมค่าข้อมูลที่
ขาดหาย เช่น เติมค่ารายได้ของลูกค้าที่ขาดหาย ด้วยค่าเฉลี่ยของลูกค้าที่อยู่ในกลุ่ม
อาชีพ เดียวกัน
12. Use the most propable value to fill in the missing value
ใช้ค่าที่เป็นไปได้มากที่สุด เติมแทนค่าข้อมูลที่ขาดหาย เช่น ค่าที่ได้จาก
สมการความถดถอย (Regression) ค่าที่ได้จากการอนุมาน โดยใช้สูตรของเบย์
(Bayesian formula) หรือต้นไม้ตัดสินใจ (Decision tree) เช่น
ใช้ข้อมูลลูกค้า มาสร้างต้นไม้ตัดสินใจ เพื่อทานายรายได้ของลูกค้า แล้วนาไป
แทนค่าที่ขาดหาย
วิธีนี้นิยมกันแพร่หลาย เนื่องจากทานายค่าข้อมูลที่ขาดหาย โดยพิจารณาจาก
ค่าของข้อมูลชุดปัจจุบัน และความสัมพันธ์ระหว่างคุณลักษณะในชุดข้อมูล
15. Binning Methods
การปรับข้อมูลให้เรียบด้วยวิธีการแบบ binning ทาโดย
เรียงลาดับข้อมูล แล้วใช้หลักการตัดแบ่ง (Partition) แบ่งข้อมูลออกเป็นส่วนแต่
ละส่วนเรียกว่า bin แล้วทาการปรับเรียบข้อมูลในแต่ละ bin โดยใช้การปรับ
เรียบข้อมูลแบบท้องถิ่น (Local Smoothing) โดยใช้ค่าที่ได้จากเพื่อนบ้าน
ใกล้เคียง (Neighborhood) ใน bin หรือ bucket เดียวกันเช่น ค่าเฉลี่ยของ bin
(Bin Means) ค่ากลางของ bin (Bin Medians) หรือค่าขอบของ bin (Bin
Boundaries)
16. Binning Methods
Binning Method
Example:
Sorted data for price (in dollars):4, 8, 15, 21, 21, 24, 25, 28, 34
(N = 3)
Partition into (equi-depth)bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing by bin means:
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Smoothing by bin boundaries:
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
17. Regression
Smooth by fitting the data into regression functions
Linear Regression
Y =α +βX
Multiple Linear Regression
Y =b0 +b1 X1 +b2 X2 +...+bmXm
27. Data Integration
Data integration:
combines data from multiple sources into a
coherent store
Schema integration
integrate metadata from different sources
Entity identification problem: identify real world
entities from multiple data sources, e.g., A.cust-id
B.cust-#
Detecting and resolving data value conflicts
for the same real world entity, attribute values from
different sources are different
possible reasons: different representations, different
scales, e.g., metric vs. British units
28. Data Integration (Cont.)
Redundant data occur often when integration of
multiple databases
The same attribute may have different names
in different databases
One attribute may be a “derived” attribute in
another table, e.g., annual revenue
Redundant data may be able to be detected by
correlation analysis
Careful integration of the data from multiple
sources may help reduce/avoid redundancies
and inconsistencies and improve mining speed
and quality
29. Data Integration : Correlation analysis
The correlation between attribute A and B can be
measured by
If rA,B greater than 0, then A and B are positively
correlated, meaning that the value of A increase
as the values of B increase
The mean of A is
The standard deviation of A is
BA
BA
n
BBAA
r
)1(
))((
,
n
A
A
1
)( 2
n
AA
A
31. 4) การลดรูปข้อมูล (Data Reduction) (cont.)
Data reduction strategies
Data cube aggregation
Dimensionality reduction — remove unimportant attributes
Data Compression
Numerosity reduction — fit data into models
Discretization and concept hierarchy generation
32. Data Reduction: Data cube aggregation
The data can be aggregated that the resulting data summarize
Ex. The data consist of the ALLElectronics sales per quarter,
for the year 2002 to 2004.
aggregated in data summarize the total sales per year instead of
per quarter, without loss of information necessary of the
analysis task
33. Data Reduction: Data cube aggregation
Concept hierarchies may exist for each
attribute, allowing the analysis of data at
multiple levels of abstraction
Data cube Lattice of cuboids
34. Data Reduction: Dimensionality reduction
Feature selection (i.e., attribute subset
selection):
Select a minimum set of features such that the
probability distribution of different classes
given the values for those features is as close
as possible to the original distribution given the
values of all features
reduce # of patterns in the patterns, easier to
understand
Heuristic methods:
step-wise forward selection
step-wise backward elimination
combining forward selection and backward
elimination
35. Data Reduction: Dimensionality reduction
Step-wise forward selection
Start with an empty of attributes called the reduced set
The best of the original attributes is determined and added to the reduced set
At each subsequent iteration, the best of the remaining original attributes
is added to the reduced set Initial attribute set:
{A1, A2, A3, A4, A5, A6}
Initial reduced set:
{}
{A1}
{A1, A4}
Reduced attribute set:
{A1, A4, A6}
36. Data Reduction: Dimensionality reduction
Step-wise backward elimination
Start with the full set of attributes
At each step, removes the worst attribute
remaining in the set
Initial attribute set:
{A1, A2, A3, A4, A5, A6}
Initial reduced set:
{A1, A3, A4, A5, A6}
{A1, A4, A5, A6}
Reduced attribute set:
{A1, A4, A6}
37. Data Reduction: Dimensionality reduction
Combining forward selection and backward
elimination
At each step, selects the best attribute
and removes the worst from among the
remaining attributes
38. Data Reduction: Dimensionality reduction
Decision-tree induction
Initial attribute set:
{A1, A2, A3, A4, A5, A6}
A4 ?
A1? A6?
Class 1 Class 2 Class 1 Class 2
> Reduced attribute set: {A1, A4, A6}
39. Data Reduction: Data Compression
String compression
There are extensive theories and well-tuned algorithms
Typicallylossless
But only limited manipulation is possible without expansion
Audio/video compression
Typically lossy compression, with progressive refinement
Sometimes small fragments of signal can be reconstructed
without reconstructing the whole
Time sequence is not audio
Typicallyshort and vary slowly with time
40. Data Reduction: Numerosity reduction
Reduce data volume by choosing alternative, smaller forms of data representation
Type of Numerosity reduction:
Parametric methods
Assume the data fits some model, estimate model parameters, store only the
parameters, and discard the data (except possible outliers)
Example: Regression
Non-parametric methods
Do not assume models
Major families: histograms, clustering, sampling
41. Data Reduction: Numerosity reduction
Histograms
A popular data reduction technique
Divide data into buckets and store average (sum) for each bucket
Can be constructedoptimally in one dimension using dynamic programming
Related to quantization problems.
0
5
10
15
20
25
30
35
40
10000 20000 30000 40000 50000 60000 70000 80000 90000 100000
42. Data Reduction: Numerosity reduction
Clustering
Partition data set into clusters, and one
can store cluster representation only
Can be very effective if data is clustered
but not if data is “smeared”
Can have hierarchical clustering and be
stored in multi-dimensional index tree
structures
There are many choices of clustering
definitions and clustering algorithms
43. Data Reduction: Numerosity reduction
Sampling
obtaining a small sample s to represent the whole
data set N
Simple Random Sample Without Replacement
(SRSWOR)
The probability of drawing any tuple in D is 1/N
Simple Random Sample With Replacement (SRSWR)
Cluster /Stratified sampling
Approximate the percentage of each class (or
subpopulation of interest) in the overall database
47. Data Reduction: Numerosity reduction
Hierarchical Reduction
reduce the data by collecting and replacing low
level concepts (such as numeric values for the
attribute age) by higher level concepts (such as
young, middle-aged, or senior)
Ex. Suppose that the tree contain 10,000 tuples with key
ranging form 1 to 6 buckets for the key. Each bucket contains
roughly 10,000/6 items. Therefore, each bucket has pointers
to the data keys 986, 3396, 5411, 8392 and 9544, respectively.
The use of multidimensional index trees as a form of data
reduction relies on an ordering of the attribute values in each
dimension.
48. Data Reduction: Discretization
Three types of attributes:
Nominal — values from an unordered set
Ordinal — values from an ordered set
Continuous — real numbers
Discretization:
divide the range of a continuous attribute into
intervals
Some classification algorithms only accept
categorical attributes.
Reduce data size by discretization
Prepare for further analysis
49. Data Reduction: Discretization
Typical methods: All the methods can be
applied recursively
Binning
Histogram analysis
Clustering analysis
Entropy-based discretization
Segmentation by natural partitioning
50. Data Reduction: Discretization
Entropy-based discretization
Example: Coin Flip
AX = {heads, tails}
P(heads) = P(tails) = ½
½ log2(½) = ½ * - 1
H(X) = 1
What about a two-headed coin?
Conditional Entropy:
2( ) ( )log ( )
Xx A
H X P x P x
( | ) ( ) ( | )
Yy A
H X Y P y H X y
51. Data Reduction: Discretization
Given a set of samples S, if S is partitioned into two
intervals S1 and S2 using boundary T, the entropy
after partitioning is
The boundary that minimizes the entropy function
over all possible boundaries is selected as a binary
discretization.
The process is recursively applied to partitions
obtained until some stopping criterion is met, e.g.,
Experiments show that it may reduce data size and
improve classification accuracy
1 2
1 2
| | | |
( , ) ( ) ( )
| | | |
H S T H H
S S
S S
S S
( ) ( , )H S H T S
52. Data Reduction: Discretization
Segmentation by natural partitioning
A simply 3-4-5 rule can be used to segment numeric
data into relatively uniform, “natural” intervals
distinct values at
the most
significant digit
Natural interval
(equi-width)
3, 6, 9 3
7 3 (2-3-2)
2, 4, 8 4
1, 5, 10 5
54. Data Reduction: Concept Hierarchy
Specification of a partial ordering of attributes
explicitly at the schema level by users or experts
street<city<state<country
Specification of a portion of a hierarchy by
explicit data grouping
{Urbana, Champaign, Chicago}<Illinois
Specification of a set of attributes.
System automatically generates partial ordering by
analysis of the number of distinct values
E.g., street < city <state < country
Specification of only a partial set of attributes
E.g., only street < city, not others
55. Data Reduction: Concept Hierarchy
Automatic Concept Hierarchy Generation
Some concept hierarchies can be automatically generated based on
the analysis of the number of distinct values per attribute in the
given data set
The attribute with the most distinct values is placed at the lowest
level of the hierarchy
Note: Exception—weekday, month, quarter, year
56. Data Reduction: Concept Hierarchy
Automatic Concept Hierarchy Generation
country
province or_ state
city
street
15 distinct values
65 distinct values
3567 distinct values
674,339 distinct values
57.
58. HW#3
1. What is Data preprocessing?
2. Why Preprocess the Data?
3. What is Major Tasks in Data Preprocessing?
4. What is Data cleaning task?
5. How to Handle Missing Data?
6. What is Normalization Method?
59. HW#3
7. Attribute income are $50,000 (min) and $ 150,000 (max). A
value of $ 100,000 for income would like to map to the new
range in [3,5]. Please calculate the income is transformed
?
8. Attribute income are $76,000 (mean) and $ 12,500 (std). A
value of $ 95,000 for income would like to map to the new
range. Please calculate the income is transformed ?
9. Attribute A range -650 to 999 normalized to decimal value
of -650 to decimal scaling therefore, j = 2?
10. What is Task in Data Integration?
11. What is Data reduction strategy?