Big Data TechCon: How to Compute Column Dependencies on a Data Stream Using MapReduce

Easier, Faster, Smarter
Friday, October 18, 2013

How to compute
Column Dependencies on a
Data Stream using MapReduce
Hans-Henning Gabriel

© 2013 Datameer, Inc. All rights reserved.


Relationship Between Attributes



Some Basic Theory


From Entropy To Mutual Information
A
x
x
y
x
z
z
y

B
a
b
a
a
b
b
a

C
just
some
random
text
in
this
column

Relationship Between A and B?



A == z ➔ B == b
B == b ➔ A == ?
C ➔ A?

How strong do A, B and C
determine each other?

Entropy: how mixed up are the values?
H(X) =

x

1
p(x) log
p(x)

• H(X) ≥ 0
• maximum entropy is log |X|
• the more X is uniform distributed , the higher the
Entropy is

H(Y ) = 0.54

H(Y ) = 1



H(Z) = 1.41

A
x
x
y
x
z
z
y

B
a
b
a
a
b
b
a

Joint Entropy:

x

y

1
p(x, y) log
p(x, y)

H(A, B) = 1.95

x

y

a

2/7 2/7

b

1/7

0

z
0

4/7

2/7 3/7

3/7 2/7 2/7
H(A) = 1.56



H(X, Y ) =

H(B) = 0.985

A
x
x
y
x
z
z
y

B
a
b
a
a
b
b
a

Conditional Entropy:
how much uncertainty remains about X when
we know the value of Y?

H(Y |X) =
p(x)H(Y |X = x)
x

x
y
z
a 2/4 2/4 0 1.0
b 1/3 0 2/3 1.0
• compute Entropies on conditional distribution
• compute weighted average
4
3
H(A|B) = ∗ H(A|B = a) + ∗ H(A|B = b) = 0.965
7
7


A
x
x
y
x
z
z
y

B
a
b
a
a
b
b
a

Mutual Information:
reduction of uncertainty of X due to the
knowledge of Y
I(X; Y ) = H(Y ) − H(Y |X) = H(X) − H(X|Y )

p(x, y)
=
p(x, y)log
p(x)p(y)
x
y



Further Conditions
data arrives as a stream
data is big
as little user interaction as possible



Outline


Outline
Partition Incremental Discretization (PiD)
•
•
•

original
adjusted
as MapReduce

2-D histograms on a data stream
•
•
•

how to create
handle discrete data
mutual information

QA



Partition Incremental
Discretization (PiD)


PiD - 2 layer approach
counts
7
2

3
3

10
4

alpha?
5

Border Extension
10
3

7

breaks
2

Histogram of Values

3

5
4

5
5

6

10

Split

5

Frequency

15

step=1

5 5

5

5

0

7
2

3

4

5

6

Values



2

3 3.5 4

5

6

PiD - dropping parameters
splitting threshold alpha:
count + 1
α
total + 2

what is a good value?
parameter step:
maintain min and max values
extend border breaks based on min and max



PiD - number of bins
count + 1
split when: total
−2
α

200
150
0

50

100

number of bins

250

300

alpha=0.01
alpha=0.02
alpha=0.04
alpha=0.08
alpha=0.16
alpha=0.32

0

200

400

600
number of records



800

1000

PiD - MapReduce

A3
A1

A2

A5

A1

A4

A6

A2
+
A5

A3
+
A6

A7

A8

A4
+
A7

A8



PiD - MapReduce



PiD - Evaluation
Percentage Error
(P, S) =

k

i=1

|Pi − Si |

k

i=1

Si

Afﬁnity Coefﬁcient
δ(P, S) =

k

i=1



Pi ∗ Si

PiD - Evaluation
Uniform Distribution

6000
4000

600

Varying Distributions



0

200
0

0

500

2000

1000

!PiD=0.0010695
!aPiD=0.0044543
PiD=0.9999998
aPiD=0.9999959

400

!PiD=0.0934349
!aPiD=0.0369968
PiD=0.9869035
aPiD=0.9956227

1500

2000

800

2500

original
PiD
aPiD

Log Normal Distribution

1000

Normal Distribution

!PiD=0.0153203
!aPiD=0.0197731
PiD=0.9993737
aPiD=0.9958205

PiD - Evaluation

Varying Alpha


Two-Dimensional
Histograms


Building a Quadtree
1

3

2

2 3
11 1
21

1

3
2

2
3 1

• how to choose bin width?
• how to merge?
• equal frequencies or equal width?



1

1 1

2

Distributed Merge

• start with unit-square
• extend by double; split by half

1{

➔ logarithmic number of splits/extensions

• merge by aligning unit-squares
1
2

2 3
11 1
21

1

3
1
2

2
2
1

8



2

5

1.5 2.5 4
1.51.5

5

2.5 1.5 2.51.5 3

Deriving the Layer 2 Histogram
2

1.5
2.5

5

2.5
1.5

4
1.5 1.5
2.5 1.5

2

5

3

1.5
2.5

Equal Width

5

2.5
1.5

4
1.5 1.5
2.5 1.5

5

3

2.5 2.5 7.25 5.25
2.5
2.5 6.25 5.25

= 34

➔ 4.25 per bin

Equal
Frequency



How to deal with discrete data
PiD and Map per bin
A

B

2

e

2.3

3

g

3.6

e
a

4.1
2.9

...

4

1.5

{a:3, e:1}
{e:2, g:2, h:1}
...
...

...

5
6

1.5
2
2.5
3

2.5
3.5

{a:1, b:1}
{e:2}

{a:0.5, b:0.5}
{a:2, b:0.5, e:0.5}
...
...

Layer 2: number of bins = |vocabulary|


Mutual Information

equal width



equal frequency

5

10

15

20

20
0

10

15

20

5

10

15

20

Mutual Information: 0.396 (0.919)

10

15

20


10
5
0
-5
0

5

10

15

20




5

15

20
15
10
5
0
0

0


-5

-5

0

5

10

15

20


5

20

0

0

5

10

15

20
15
10
5
0

0

5

10

15

20

Mutual Information

0

5

10

15

20


Normalization

I(X; Y )
H(X)H(Y )

• panelize variable with large cardinality
• scale value between 0 and 1



@Datameer
hgabriel@datameer.com


Big Data TechCon: How to Compute Column Dependencies on a Data Stream Using MapReduce

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Mehr von Datameer

Mehr von Datameer (16)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Big Data TechCon: How to Compute Column Dependencies on a Data Stream Using MapReduce