Weitere ähnliche Inhalte Kürzlich hochgeladen (20) Big Data TechCon: How to Compute Column Dependencies on a Data Stream Using MapReduce2. How to compute
Column Dependencies on a
Data Stream using MapReduce
Hans-Henning Gabriel
© 2013 Datameer, Inc. All rights reserved.
Friday, October 18, 2013
5. From Entropy To Mutual Information
A
x
x
y
x
z
z
y
B
a
b
a
a
b
b
a
C
just
some
random
text
in
this
column
Relationship Between A and B?
© 2013 Datameer, Inc. All rights reserved.
Friday, October 18, 2013
A == z ➔ B == b
B == b ➔ A == ?
C ➔ A?
How strong do A, B and C
determine each other?
6. From Entropy To Mutual Information
Entropy: how mixed up are the values?
H(X) =
x
1
p(x) log
p(x)
• H(X) ≥ 0
• maximum entropy is log |X|
• the more X is uniform distributed , the higher the
Entropy is
H(Y ) = 0.54
H(Y ) = 1
© 2013 Datameer, Inc. All rights reserved.
Friday, October 18, 2013
H(Z) = 1.41
7. From Entropy To Mutual Information
A
x
x
y
x
z
z
y
B
a
b
a
a
b
b
a
Joint Entropy:
x
y
1
p(x, y) log
p(x, y)
H(A, B) = 1.95
x
y
a
2/7 2/7
b
1/7
0
z
0
4/7
2/7 3/7
3/7 2/7 2/7
H(A) = 1.56
© 2013 Datameer, Inc. All rights reserved.
Friday, October 18, 2013
H(X, Y ) =
H(B) = 0.985
8. From Entropy To Mutual Information
A
x
x
y
x
z
z
y
B
a
b
a
a
b
b
a
Conditional Entropy:
how much uncertainty remains about X when
we know the value of Y?
H(Y |X) =
p(x)H(Y |X = x)
x
x
y
z
a 2/4 2/4 0 1.0
b 1/3 0 2/3 1.0
• compute Entropies on conditional distribution
• compute weighted average
4
3
H(A|B) = ∗ H(A|B = a) + ∗ H(A|B = b) = 0.965
7
7
© 2013 Datameer, Inc. All rights reserved.
Friday, October 18, 2013
9. From Entropy To Mutual Information
A
x
x
y
x
z
z
y
B
a
b
a
a
b
b
a
Mutual Information:
reduction of uncertainty of X due to the
knowledge of Y
I(X; Y ) = H(Y ) − H(Y |X) = H(X) − H(X|Y )
p(x, y)
=
p(x, y)log
p(x)p(y)
x
y
© 2013 Datameer, Inc. All rights reserved.
Friday, October 18, 2013
10. Further Conditions
data arrives as a stream
data is big
as little user interaction as possible
© 2013 Datameer, Inc. All rights reserved.
Friday, October 18, 2013
12. Outline
Partition Incremental Discretization (PiD)
•
•
•
original
adjusted
as MapReduce
2-D histograms on a data stream
•
•
•
how to create
handle discrete data
mutual information
QA
© 2013 Datameer, Inc. All rights reserved.
Friday, October 18, 2013
14. PiD - 2 layer approach
counts
7
2
3
3
10
4
alpha?
5
Border Extension
10
3
7
breaks
2
Histogram of Values
3
5
4
5
5
6
10
Split
5
Frequency
15
step=1
5 5
5
5
0
7
2
3
4
5
6
Values
© 2013 Datameer, Inc. All rights reserved.
Friday, October 18, 2013
2
3 3.5 4
5
6
15. PiD - dropping parameters
splitting threshold alpha:
count + 1
α
total + 2
what is a good value?
parameter step:
maintain min and max values
extend border breaks based on min and max
© 2013 Datameer, Inc. All rights reserved.
Friday, October 18, 2013
16. PiD - number of bins
count + 1
split when: total
−2
α
200
150
0
50
100
number of bins
250
300
alpha=0.01
alpha=0.02
alpha=0.04
alpha=0.08
alpha=0.16
alpha=0.32
0
200
400
600
number of records
© 2013 Datameer, Inc. All rights reserved.
Friday, October 18, 2013
800
1000
18. PiD - MapReduce
© 2013 Datameer, Inc. All rights reserved.
Friday, October 18, 2013
19. PiD - Evaluation
Percentage Error
(P, S) =
k
i=1
|Pi − Si |
k
i=1
Si
Affinity Coefficient
δ(P, S) =
k
i=1
© 2013 Datameer, Inc. All rights reserved.
Friday, October 18, 2013
Pi ∗ Si
20. PiD - Evaluation
Uniform Distribution
6000
4000
600
Varying Distributions
© 2013 Datameer, Inc. All rights reserved.
Friday, October 18, 2013
0
200
0
0
500
2000
1000
!PiD=0.0010695
!aPiD=0.0044543
PiD=0.9999998
aPiD=0.9999959
400
!PiD=0.0934349
!aPiD=0.0369968
PiD=0.9869035
aPiD=0.9956227
1500
2000
800
2500
original
PiD
aPiD
Log Normal Distribution
1000
Normal Distribution
!PiD=0.0153203
!aPiD=0.0197731
PiD=0.9993737
aPiD=0.9958205
23. Building a Quadtree
1
3
2
2 3
11 1
21
1
3
2
2
3 1
• how to choose bin width?
• how to merge?
• equal frequencies or equal width?
© 2013 Datameer, Inc. All rights reserved.
Friday, October 18, 2013
1
1 1
2
24. Distributed Merge
• start with unit-square
• extend by double; split by half
1{
➔ logarithmic number of splits/extensions
• merge by aligning unit-squares
1
2
2 3
11 1
21
1
3
1
2
2
2
1
8
© 2013 Datameer, Inc. All rights reserved.
Friday, October 18, 2013
2
5
1.5 2.5 4
1.51.5
5
2.5 1.5 2.51.5 3
25. Deriving the Layer 2 Histogram
2
1.5
2.5
5
2.5
1.5
4
1.5 1.5
2.5 1.5
2
5
3
1.5
2.5
Equal Width
5
2.5
1.5
4
1.5 1.5
2.5 1.5
5
3
2.5 2.5 7.25 5.25
2.5
2.5 6.25 5.25
= 34
➔ 4.25 per bin
Equal
Frequency
© 2013 Datameer, Inc. All rights reserved.
Friday, October 18, 2013
26. How to deal with discrete data
PiD and Map per bin
A
B
2
e
2.3
3
g
3.6
e
a
4.1
2.9
...
4
1.5
{a:3, e:1}
{e:2, g:2, h:1}
...
...
...
5
6
1.5
2
2.5
3
2.5
3.5
{a:1, b:1}
{e:2}
{a:0.5, b:0.5}
{a:2, b:0.5, e:0.5}
...
...
Layer 2: number of bins = |vocabulary|
© 2013 Datameer, Inc. All rights reserved.
Friday, October 18, 2013
28. 5
10
15
20
20
0
10
15
20
5
10
15
20
Mutual Information: 0.396 (0.919)
10
15
20
Mutual Information: 0.023 (0.026)
10
5
0
-5
0
5
10
15
20
Mutual Information: 0.171 (0.131)
© 2013 Datameer, Inc. All rights reserved.
Friday, October 18, 2013
5
15
20
15
10
5
0
0
0
Mutual Information: 0.102 (0.022)
-5
-5
0
5
10
15
20
Mutual Information: 0.013 (0.03)
5
20
0
0
5
10
15
20
15
10
5
0
0
5
10
15
20
Mutual Information
0
5
10
15
20
Mutual Information: 0.35 (0.544)
29. Normalization
I(X; Y )
H(X)H(Y )
• panelize variable with large cardinality
• scale value between 0 and 1
© 2013 Datameer, Inc. All rights reserved.
Friday, October 18, 2013