Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)

Exact Inference in Bayesian
Networks using MapReduce
Alex Kozlov
Cloudera, Inc.

Session Agenda

 About Me
 About Cloudera
 Bayesian (Probabilistic) Networks
 BN Inference 101
 CPCS Network
 Why BN Inference
 Inference with MR
 Results
 Conclusions
2

About Me

 Worked on BN Inference in 1995-1998 (for Ph.D.)
› Published the fastest implementation at the time
 Worked on DM/BI field since then
 Recently joined Cloudera, Inc.
› Started looking at how to solve world’s hardest problems

3

About Cloudera

Founded in the summer 2008
Cloudera helps organizations profit from all of their data. We deliver the
industry-standard platform which consolidates, stores and processes
any kind of data, from any source, at scale. We make it possible to do
more powerful analysis of more kinds of data, at scale, than ever
before. With Cloudera, you get better insight into their customers,
partners, vendors and businesses.

Cloudera’s platform is built on the popular open source Apache Hadoop
project. We deliver the innovative work of a global community of
contributors in a package that makes it easy for anyone to put the
power of Google, Facebook and Yahoo! to work on their own problems.

4

Bayesian Networks

1. Nodes
2. Edges
3. Probabilities

Bayes, Thomas (1763)
An essay towards solving a problem in
the doctrine of chances, published
posthumously by his friend
Philosophical Transactions of the
Royal Society of London, 53:370-418

5

Applications

1. Computational biology and bioinformatics (gene regulatory networks,
protein structure, gene expression analysis)
2. Medicine
3. Document classification, information retrieval
4. Image processing
5. Data fusion
6. Gaming
7. Law
8. On-line advertising!

6

A Simple BN Network

Rain T F
Rain T F
F 0.4 0.6
T 0.1 0.9 0.2 0.8

Sprinkler

Sprinkler, Rain T F

F, F 0.01 0.99
Wet F, T 0.8 0.2
Driveway T, F 0.9 0.1
T, T 0.99 0.01

Pr(Rain | Wet Driveway)
Pr(Sprinkler Broken | !Wet Driveway & !Rain)
7

BN Inference 101 (in Hive)

JPD = <product of all probabilities and conditional
probabilities in the network> = Pr(A, B, …, H)
PAB =
SELECT A, B, SUM(PROB) FROM JPD GROUP BY A, B;
PB = SELECT B, SUM(PROB) FROM PAB GROUP BY A;
Pr(A|B) = Pr(A,B)/Pr(B) – Bayes’ rule

CPCS is 422 nodes, a table of at least 2422 rows!

9

CPCS Networks

422 nodes

14 nodes describe
diseases

33 risk factors

375 various findings
related to diseases

11

Why Bayesian Network Inference?

Choose the right tool for the right job!

 BN is an abstraction for reasoning and decision making
 Easy to incorporate human insight and intuitions
 Very general, no specific ‘label’ node
 Easy to do ‘what-if’, strength of influence, value of information,
analysis
 Immune to Gaussian assumptions

It’s all just a joint probability distribution

13

Map & Reduces
Map Keys

B1C1E1
A1B1 B1C1E2
Reduce
A2B1 B1 B1C2E1
B1C2E2
A1B2 B2C1E1
A2B2 B2 B2C1E2 ∑ Pr(B1| A) x ∑ Pr(D| C1)
B2C2E1
B2C2E2
B1C1E1
C1D1 B1C1E2 Pr(C| BE) x ∑ Pr(B1| A) x ∑ Pr(D| C1)

C2D1 C1 B1C2E1
B1C2E2 Aggregation 2 (x)
C1D2 B2C1E1
C2D2 C2 B2C1E2
B2C2E1 BCE
B2C2E2
Aggregation 1 (+)
14

MapReduce Implementation

for each clique in depth-first order:
MAP:
Sum over the variables to get ‘clique message’ (requires state, custom
partitioner and input format)
Emit factors for the next clique

REDUCE:
Multiply the factors from all children
Include probabilities assigned to the clique
Form the new clique values

the MAP is done over all child cliques

15

Cliques, Trees, and Parallelism

C6
o Topological parallelism: compute
branches C2 and C4 in parallel
C5 o Clique parallelism: divide
computation of each clique into
maps/reducers
C4
o Fall back into optimal factoring if a
corresponding subtree is small
C3
o Combine multiple phases together
C2 o Reduce replication level

C1
Cliques may be larger than they
appear!
16

CPCS Inference

CPCS:
The 360-node subnet has the largest ‘clique’ of
11,739,896 floats (fits into 2GB)
The full 422-node version (absent, mild, moderate, severe)
3,377,699,720,527,872 floats (or 12 PB of storage, but do not
need it for all queries)

In most cases do not need to do inference on the full network

17

Results

Network Memory Time Macbook Hadoop
(19971) Pro (20102) (& future3)
Random 10 MB 33 sec < 1 sec
(B)
Random 254 MB 260 sec 10 sec
(A)
cpcs360 2 GB 640 sec 15 sec 1 min
cpcs422 > 12 PB N/A N/A Minutes to hours for
most of the queries on
most of the clusters

1‘used an SGI Origin 2000 machine with sixteen MIPS R10000 processors (195
MHz clock speed)’ in 1997
2Macbook Pro 4 GB DDR3 2.53 GHz
310 node Linux Xeon cluster 24 GB quad 2-core

18

Conclusions

 Exact probabilistic inference is finally in sight for the full 422 node
CPCS network
 Hadoop helps to solve the world’s hardest problems

What you should know after this talk

BN is a DAG and represents a joint probability distribution (JPD)
Can compute conditional probabilities by multiplying and summing JPD
For large networks, this may be PBytes of intermediate data, but it’s MR

19

Questions?

alexvk@{cloudera,gmail}.com

Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)

Ähnlich wie Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010) (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)