SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Easier, Faster, Smarter
Friday, October 18, 2013
How to compute
Column Dependencies on a
Data Stream using MapReduce
Hans-Henning Gabriel

© 2013 Datameer, Inc. All rights reserved.

Friday, October 18, 2013
Relationship Between Attributes

© 2013 Datameer, Inc. All rights reserved.

Friday, October 18, 2013
Some Basic Theory

Friday, October 18, 2013
From Entropy To Mutual Information
A
x
x
y
x
z
z
y

B
a
b
a
a
b
b
a

C
just
some
random
text
in
this
column

Relationship Between A and B?

© 2013 Datameer, Inc. All rights reserved.

Friday, October 18, 2013

A == z ➔ B == b
B == b ➔ A == ?
C ➔ A?

How strong do A, B and C
determine each other?
From Entropy To Mutual Information
Entropy: how mixed up are the values?
H(X) =


x

1
p(x) log
p(x)

• H(X) ≥ 0
• maximum entropy is log |X|
• the more X is uniform distributed , the higher the
Entropy is

H(Y ) = 0.54

H(Y ) = 1

© 2013 Datameer, Inc. All rights reserved.

Friday, October 18, 2013

H(Z) = 1.41
From Entropy To Mutual Information
A
x
x
y
x
z
z
y

B
a
b
a
a
b
b
a

Joint Entropy:

x

y

1
p(x, y) log
p(x, y)

H(A, B) = 1.95

x

y

a

2/7 2/7

b

1/7

0

z
0

4/7

2/7 3/7

3/7 2/7 2/7
H(A) = 1.56

© 2013 Datameer, Inc. All rights reserved.

Friday, October 18, 2013

H(X, Y ) =



H(B) = 0.985
From Entropy To Mutual Information
A
x
x
y
x
z
z
y

B
a
b
a
a
b
b
a

Conditional Entropy:
how much uncertainty remains about X when
we know the value of Y?

H(Y |X) =
p(x)H(Y |X = x)
x

x
y
z
a 2/4 2/4 0 1.0
b 1/3 0 2/3 1.0
• compute Entropies on conditional distribution
• compute weighted average
4
3
H(A|B) = ∗ H(A|B = a) + ∗ H(A|B = b) = 0.965
7
7
© 2013 Datameer, Inc. All rights reserved.

Friday, October 18, 2013
From Entropy To Mutual Information
A
x
x
y
x
z
z
y

B
a
b
a
a
b
b
a

Mutual Information:
reduction of uncertainty of X due to the
knowledge of Y
I(X; Y ) = H(Y ) − H(Y |X) = H(X) − H(X|Y )



p(x, y)
=
p(x, y)log
p(x)p(y)
x
y

© 2013 Datameer, Inc. All rights reserved.

Friday, October 18, 2013
Further Conditions
data arrives as a stream
data is big
as little user interaction as possible

© 2013 Datameer, Inc. All rights reserved.

Friday, October 18, 2013
Outline

Friday, October 18, 2013
Outline
Partition Incremental Discretization (PiD)
•
•
•

original
adjusted
as MapReduce

2-D histograms on a data stream
•
•
•

how to create
handle discrete data
mutual information

QA

© 2013 Datameer, Inc. All rights reserved.

Friday, October 18, 2013
Partition Incremental
Discretization (PiD)

Friday, October 18, 2013
PiD - 2 layer approach
counts
7
2

3
3

10
4

 alpha?
5

Border Extension
10
3

7

breaks
2

Histogram of Values

3

5
4

5
5

6



10

Split

5

Frequency

15

step=1

5 5

5

5

0

7
2

3

4

5

6

Values

© 2013 Datameer, Inc. All rights reserved.

Friday, October 18, 2013

2

3 3.5 4

5

6
PiD - dropping parameters
splitting threshold alpha:
count + 1
α
total + 2

what is a good value?
parameter step:
maintain min and max values
extend border breaks based on min and max

© 2013 Datameer, Inc. All rights reserved.

Friday, October 18, 2013
PiD - number of bins
count + 1
split when: total 
−2
α

200
150
0

50

100

number of bins

250

300

alpha=0.01
alpha=0.02
alpha=0.04
alpha=0.08
alpha=0.16
alpha=0.32

0

200

400

600
number of records

© 2013 Datameer, Inc. All rights reserved.

Friday, October 18, 2013

800

1000
PiD - MapReduce

A3
A1

A2

A5

A1

A4

A6

A2
+
A5

A3
+
A6

A7

A8

A4
+
A7

A8

© 2013 Datameer, Inc. All rights reserved.

Friday, October 18, 2013
PiD - MapReduce

© 2013 Datameer, Inc. All rights reserved.

Friday, October 18, 2013
PiD - Evaluation
Percentage Error
(P, S) =

k

i=1

|Pi − Si |

k

i=1

Si

Affinity Coefficient
δ(P, S) =

k

i=1

© 2013 Datameer, Inc. All rights reserved.

Friday, October 18, 2013


Pi ∗ Si
PiD - Evaluation
Uniform Distribution

6000
4000

600

Varying Distributions

© 2013 Datameer, Inc. All rights reserved.

Friday, October 18, 2013

0

200
0

0

500

2000

1000

!PiD=0.0010695
!aPiD=0.0044543
PiD=0.9999998
aPiD=0.9999959

400

!PiD=0.0934349
!aPiD=0.0369968
PiD=0.9869035
aPiD=0.9956227

1500

2000

800

2500

original
PiD
aPiD

Log Normal Distribution

1000

Normal Distribution

!PiD=0.0153203
!aPiD=0.0197731
PiD=0.9993737
aPiD=0.9958205
PiD - Evaluation

Varying Alpha
© 2013 Datameer, Inc. All rights reserved.

Friday, October 18, 2013
Two-Dimensional
Histograms

Friday, October 18, 2013
Building a Quadtree
1

3

2

2 3
11 1
21

1

3
2

2
3 1

• how to choose bin width?
• how to merge?
• equal frequencies or equal width?

© 2013 Datameer, Inc. All rights reserved.

Friday, October 18, 2013

1

1 1

2
Distributed Merge

• start with unit-square
• extend by double; split by half

1{

➔ logarithmic number of splits/extensions

• merge by aligning unit-squares
1
2

2 3
11 1
21

1

3
1
2

2
2
1

8

© 2013 Datameer, Inc. All rights reserved.

Friday, October 18, 2013

2

5

1.5 2.5 4
1.51.5

5

2.5 1.5 2.51.5 3
Deriving the Layer 2 Histogram
2

1.5
2.5

5

2.5
1.5

4
1.5 1.5
2.5 1.5

2

5

3

1.5
2.5

Equal Width

5

2.5
1.5

4
1.5 1.5
2.5 1.5

5

3

2.5 2.5 7.25 5.25
2.5
2.5 6.25 5.25



= 34

➔ 4.25 per bin

Equal
Frequency

© 2013 Datameer, Inc. All rights reserved.

Friday, October 18, 2013
How to deal with discrete data
PiD and Map per bin
A

B

2

e

2.3

3

g

3.6

e
a

4.1
2.9

...

4

1.5

{a:3, e:1}
{e:2, g:2, h:1}
...
...

...

5
6

1.5
2
2.5
3

2.5
3.5

{a:1, b:1}
{e:2}

{a:0.5, b:0.5}
{a:2, b:0.5, e:0.5}
...
...

Layer 2: number of bins = |vocabulary|
© 2013 Datameer, Inc. All rights reserved.

Friday, October 18, 2013
Mutual Information

equal width

© 2013 Datameer, Inc. All rights reserved.

Friday, October 18, 2013

equal frequency
5

10

15

20

20
0

10

15

20

5

10

15

20

Mutual Information: 0.396 (0.919)

10

15

20

Mutual Information: 0.023 (0.026)

10
5
0
-5
0

5

10

15

20

Mutual Information: 0.171 (0.131)

© 2013 Datameer, Inc. All rights reserved.

Friday, October 18, 2013

5

15

20
15
10
5
0
0

0

Mutual Information: 0.102 (0.022)

-5

-5

0

5

10

15

20

Mutual Information: 0.013 (0.03)

5

20

0

0

5

10

15

20
15
10
5
0

0

5

10

15

20

Mutual Information

0

5

10

15

20

Mutual Information: 0.35 (0.544)
Normalization


I(X; Y )
H(X)H(Y )

• panelize variable with large cardinality
• scale value between 0 and 1

© 2013 Datameer, Inc. All rights reserved.

Friday, October 18, 2013
@Datameer
hgabriel@datameer.com

Friday, October 18, 2013

Weitere ähnliche Inhalte

Mehr von Datameer

How to do Predictive Analytics with Limited Data
How to do Predictive Analytics with Limited DataHow to do Predictive Analytics with Limited Data
How to do Predictive Analytics with Limited Data
Datameer
 

Mehr von Datameer (16)

Webinar - Big Data: Power to the User
Webinar - Big Data: Power to the User Webinar - Big Data: Power to the User
Webinar - Big Data: Power to the User
 
Why Use Hadoop for Big Data Analytics?
Why Use Hadoop for Big Data Analytics?Why Use Hadoop for Big Data Analytics?
Why Use Hadoop for Big Data Analytics?
 
Why Use Hadoop?
Why Use Hadoop?Why Use Hadoop?
Why Use Hadoop?
 
Online Fraud Detection Using Big Data Analytics Webinar
Online Fraud Detection Using Big Data Analytics WebinarOnline Fraud Detection Using Big Data Analytics Webinar
Online Fraud Detection Using Big Data Analytics Webinar
 
Instant Visualizations in Every Step of Analysis
Instant Visualizations in Every Step of AnalysisInstant Visualizations in Every Step of Analysis
Instant Visualizations in Every Step of Analysis
 
Customer Case Studies of Self-Service Big Data Analytics
Customer Case Studies of Self-Service Big Data AnalyticsCustomer Case Studies of Self-Service Big Data Analytics
Customer Case Studies of Self-Service Big Data Analytics
 
BI, Hive or Big Data Analytics?
BI, Hive or Big Data Analytics? BI, Hive or Big Data Analytics?
BI, Hive or Big Data Analytics?
 
Is Your Hadoop Environment Secure?
Is Your Hadoop Environment Secure?Is Your Hadoop Environment Secure?
Is Your Hadoop Environment Secure?
 
Fight Fraud with Big Data Analytics
Fight Fraud with Big Data AnalyticsFight Fraud with Big Data Analytics
Fight Fraud with Big Data Analytics
 
Complement Your Existing Data Warehouse with Big Data & Hadoop
Complement Your Existing Data Warehouse with Big Data & HadoopComplement Your Existing Data Warehouse with Big Data & Hadoop
Complement Your Existing Data Warehouse with Big Data & Hadoop
 
Lean Production Meets Big Data: A Next Generation Use Case
Lean Production Meets Big Data: A Next Generation Use CaseLean Production Meets Big Data: A Next Generation Use Case
Lean Production Meets Big Data: A Next Generation Use Case
 
The Economics of SQL on Hadoop
The Economics of SQL on HadoopThe Economics of SQL on Hadoop
The Economics of SQL on Hadoop
 
Top 3 Considerations for Machine Learning on Big Data
Top 3 Considerations for Machine Learning on Big DataTop 3 Considerations for Machine Learning on Big Data
Top 3 Considerations for Machine Learning on Big Data
 
Best Practices for Big Data Analytics with Machine Learning by Datameer
Best Practices for Big Data Analytics with Machine Learning by DatameerBest Practices for Big Data Analytics with Machine Learning by Datameer
Best Practices for Big Data Analytics with Machine Learning by Datameer
 
How to do Data Science Without the Scientist
How to do Data Science Without the ScientistHow to do Data Science Without the Scientist
How to do Data Science Without the Scientist
 
How to do Predictive Analytics with Limited Data
How to do Predictive Analytics with Limited DataHow to do Predictive Analytics with Limited Data
How to do Predictive Analytics with Limited Data
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

Big Data TechCon: How to Compute Column Dependencies on a Data Stream Using MapReduce

  • 2. How to compute Column Dependencies on a Data Stream using MapReduce Hans-Henning Gabriel © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013
  • 3. Relationship Between Attributes © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013
  • 4. Some Basic Theory Friday, October 18, 2013
  • 5. From Entropy To Mutual Information A x x y x z z y B a b a a b b a C just some random text in this column Relationship Between A and B? © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013 A == z ➔ B == b B == b ➔ A == ? C ➔ A? How strong do A, B and C determine each other?
  • 6. From Entropy To Mutual Information Entropy: how mixed up are the values? H(X) = x 1 p(x) log p(x) • H(X) ≥ 0 • maximum entropy is log |X| • the more X is uniform distributed , the higher the Entropy is H(Y ) = 0.54 H(Y ) = 1 © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013 H(Z) = 1.41
  • 7. From Entropy To Mutual Information A x x y x z z y B a b a a b b a Joint Entropy: x y 1 p(x, y) log p(x, y) H(A, B) = 1.95 x y a 2/7 2/7 b 1/7 0 z 0 4/7 2/7 3/7 3/7 2/7 2/7 H(A) = 1.56 © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013 H(X, Y ) = H(B) = 0.985
  • 8. From Entropy To Mutual Information A x x y x z z y B a b a a b b a Conditional Entropy: how much uncertainty remains about X when we know the value of Y? H(Y |X) = p(x)H(Y |X = x) x x y z a 2/4 2/4 0 1.0 b 1/3 0 2/3 1.0 • compute Entropies on conditional distribution • compute weighted average 4 3 H(A|B) = ∗ H(A|B = a) + ∗ H(A|B = b) = 0.965 7 7 © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013
  • 9. From Entropy To Mutual Information A x x y x z z y B a b a a b b a Mutual Information: reduction of uncertainty of X due to the knowledge of Y I(X; Y ) = H(Y ) − H(Y |X) = H(X) − H(X|Y ) p(x, y) = p(x, y)log p(x)p(y) x y © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013
  • 10. Further Conditions data arrives as a stream data is big as little user interaction as possible © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013
  • 12. Outline Partition Incremental Discretization (PiD) • • • original adjusted as MapReduce 2-D histograms on a data stream • • • how to create handle discrete data mutual information QA © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013
  • 14. PiD - 2 layer approach counts 7 2 3 3 10 4 alpha? 5 Border Extension 10 3 7 breaks 2 Histogram of Values 3 5 4 5 5 6 10 Split 5 Frequency 15 step=1 5 5 5 5 0 7 2 3 4 5 6 Values © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013 2 3 3.5 4 5 6
  • 15. PiD - dropping parameters splitting threshold alpha: count + 1 α total + 2 what is a good value? parameter step: maintain min and max values extend border breaks based on min and max © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013
  • 16. PiD - number of bins count + 1 split when: total −2 α 200 150 0 50 100 number of bins 250 300 alpha=0.01 alpha=0.02 alpha=0.04 alpha=0.08 alpha=0.16 alpha=0.32 0 200 400 600 number of records © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013 800 1000
  • 17. PiD - MapReduce A3 A1 A2 A5 A1 A4 A6 A2 + A5 A3 + A6 A7 A8 A4 + A7 A8 © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013
  • 18. PiD - MapReduce © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013
  • 19. PiD - Evaluation Percentage Error (P, S) = k i=1 |Pi − Si | k i=1 Si Affinity Coefficient δ(P, S) = k i=1 © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013 Pi ∗ Si
  • 20. PiD - Evaluation Uniform Distribution 6000 4000 600 Varying Distributions © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013 0 200 0 0 500 2000 1000 !PiD=0.0010695 !aPiD=0.0044543 PiD=0.9999998 aPiD=0.9999959 400 !PiD=0.0934349 !aPiD=0.0369968 PiD=0.9869035 aPiD=0.9956227 1500 2000 800 2500 original PiD aPiD Log Normal Distribution 1000 Normal Distribution !PiD=0.0153203 !aPiD=0.0197731 PiD=0.9993737 aPiD=0.9958205
  • 21. PiD - Evaluation Varying Alpha © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013
  • 23. Building a Quadtree 1 3 2 2 3 11 1 21 1 3 2 2 3 1 • how to choose bin width? • how to merge? • equal frequencies or equal width? © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013 1 1 1 2
  • 24. Distributed Merge • start with unit-square • extend by double; split by half 1{ ➔ logarithmic number of splits/extensions • merge by aligning unit-squares 1 2 2 3 11 1 21 1 3 1 2 2 2 1 8 © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013 2 5 1.5 2.5 4 1.51.5 5 2.5 1.5 2.51.5 3
  • 25. Deriving the Layer 2 Histogram 2 1.5 2.5 5 2.5 1.5 4 1.5 1.5 2.5 1.5 2 5 3 1.5 2.5 Equal Width 5 2.5 1.5 4 1.5 1.5 2.5 1.5 5 3 2.5 2.5 7.25 5.25 2.5 2.5 6.25 5.25 = 34 ➔ 4.25 per bin Equal Frequency © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013
  • 26. How to deal with discrete data PiD and Map per bin A B 2 e 2.3 3 g 3.6 e a 4.1 2.9 ... 4 1.5 {a:3, e:1} {e:2, g:2, h:1} ... ... ... 5 6 1.5 2 2.5 3 2.5 3.5 {a:1, b:1} {e:2} {a:0.5, b:0.5} {a:2, b:0.5, e:0.5} ... ... Layer 2: number of bins = |vocabulary| © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013
  • 27. Mutual Information equal width © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013 equal frequency
  • 28. 5 10 15 20 20 0 10 15 20 5 10 15 20 Mutual Information: 0.396 (0.919) 10 15 20 Mutual Information: 0.023 (0.026) 10 5 0 -5 0 5 10 15 20 Mutual Information: 0.171 (0.131) © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013 5 15 20 15 10 5 0 0 0 Mutual Information: 0.102 (0.022) -5 -5 0 5 10 15 20 Mutual Information: 0.013 (0.03) 5 20 0 0 5 10 15 20 15 10 5 0 0 5 10 15 20 Mutual Information 0 5 10 15 20 Mutual Information: 0.35 (0.544)
  • 29. Normalization I(X; Y ) H(X)H(Y ) • panelize variable with large cardinality • scale value between 0 and 1 © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013