SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Downloaden Sie, um offline zu lesen
Exact Inference in Bayesian
Networks using MapReduce
Alex Kozlov
Cloudera, Inc.
Session Agenda


 About Me
 About Cloudera
 Bayesian (Probabilistic) Networks
 BN Inference 101
 CPCS Network
 Why BN Inference
 Inference with MR
 Results
 Conclusions
                               2
About Me



 Worked on BN Inference in 1995-1998 (for Ph.D.)
 ›   Published the fastest implementation at the time
 Worked on DM/BI field since then
 Recently joined Cloudera, Inc.
 ›   Started looking at how to solve world’s hardest problems




                                   3
About Cloudera


Founded in the summer 2008
Cloudera helps organizations profit from all of their data. We deliver the
  industry-standard platform which consolidates, stores and processes
  any kind of data, from any source, at scale. We make it possible to do
  more powerful analysis of more kinds of data, at scale, than ever
  before. With Cloudera, you get better insight into their customers,
  partners, vendors and businesses.


Cloudera’s platform is built on the popular open source Apache Hadoop
  project. We deliver the innovative work of a global community of
  contributors in a package that makes it easy for anyone to put the
  power of Google, Facebook and Yahoo! to work on their own problems.


                                       4
Bayesian Networks


1. Nodes
2. Edges
3. Probabilities


 Bayes, Thomas (1763)
 An essay towards solving a problem in
 the doctrine of chances, published
 posthumously by his friend
 Philosophical Transactions of the
 Royal Society of London, 53:370-418



                                     5
Applications


1. Computational biology and bioinformatics (gene regulatory networks,
   protein structure, gene expression analysis)
2. Medicine
3. Document classification, information retrieval
4. Image processing
5. Data fusion
6. Gaming
7. Law
8. On-line advertising!



                                     6
A Simple BN Network


Rain    T     F
                                                Rain                     T      F
F      0.4 0.6
T      0.1 0.9                                                           0.2 0.8

               Sprinkler



                                                Sprinkler, Rain   T       F

                                                          F, F    0.01   0.99
                                      Wet                 F, T    0.8    0.2
                                    Driveway              T, F    0.9    0.1
                                                          T, T    0.99   0.01

       Pr(Rain | Wet Driveway)
 Pr(Sprinkler Broken | !Wet Driveway & !Rain)
                                         7
Asia Network

     Pr(Visit to Asia)          Pr(Lung Cancer | Smoking)     Pr(Smoking)




Pr(Tuberculosis | Visit to Asia)                              Pr(Bronchitis | Smoking)




                  Pr(C | BE )




Pr(X-Ray | Lung Cancer or Tuberculosis)                     Pr(Dyspnea | CG )


           Pr(Lung Cancer | Neg X-Ray & Positive Dyspnea)
                                                  8
BN Inference 101 (in Hive)


JPD = <product of all probabilities and conditional
  probabilities in the network> = Pr(A, B, …, H)
PAB =
   SELECT A, B, SUM(PROB) FROM JPD GROUP BY A, B;
PB = SELECT B, SUM(PROB) FROM PAB GROUP BY A;
Pr(A|B) = Pr(A,B)/Pr(B) – Bayes’ rule


CPCS is 422 nodes, a table of at least 2422 rows!


                                9
Junction Tree
                                                     Pr(E | F )
       Pr(Tuberculosis | Visit to Asia)
                                                        Pr(G | F )
                Pr(Visit to Asia)
                                                            Pr(F)




                        Pr(C | BE )
                                                                  Pr(H | CG )




                                               Pr(Lung Cancer | Dyspnea) =
                                                         Pr(E|H)

                      Pr(D| C)
                                          10
CPCS Networks


                     422 nodes

                     14 nodes describe
                     diseases

                     33 risk factors

                     375 various findings
                     related to diseases




                11
CPCS Networks




                12
Why Bayesian Network Inference?


                Choose the right tool for the right job!


   BN is an abstraction for reasoning and decision making
   Easy to incorporate human insight and intuitions
   Very general, no specific ‘label’ node
   Easy to do ‘what-if’, strength of influence, value of information,
    analysis
   Immune to Gaussian assumptions


              It’s all just a joint probability distribution

                                     13
Map & Reduces
          Map        Keys

                     B1C1E1
  A1B1               B1C1E2
                                                       Reduce
  A2B1       B1      B1C2E1
                     B1C2E2
  A1B2               B2C1E1
  A2B2       B2      B2C1E2        ∑ Pr(B1| A) x ∑ Pr(D| C1)
                     B2C2E1
                     B2C2E2
                     B1C1E1
  C1D1               B1C1E2             Pr(C| BE) x ∑ Pr(B1| A) x ∑ Pr(D| C1)

  C2D1      C1       B1C2E1
                     B1C2E2   Aggregation 2 (x)
  C1D2               B2C1E1
  C2D2       C2      B2C1E2
                     B2C2E1                            BCE
                     B2C2E2
 Aggregation 1 (+)
                              14
MapReduce Implementation


for each clique in depth-first order:
   MAP:
       Sum over the variables to get ‘clique message’ (requires state, custom
         partitioner and input format)
       Emit factors for the next clique

   REDUCE:
       Multiply the factors from all children
       Include probabilities assigned to the clique
       Form the new clique values

the MAP is done over all child cliques

                                            15
Cliques, Trees, and Parallelism


                  C6
                       o Topological parallelism: compute
                         branches C2 and C4 in parallel
             C5        o Clique parallelism: divide
                         computation of each clique into
                         maps/reducers
       C4
                       o Fall back into optimal factoring if a
                         corresponding subtree is small
                  C3
                       o Combine multiple phases together
            C2         o Reduce replication level

 C1
         Cliques may be larger than they
                    appear!
                         16
CPCS Inference


CPCS:
The 360-node subnet has the largest ‘clique’ of
 11,739,896 floats (fits into 2GB)
The full 422-node version (absent, mild, moderate, severe)
 3,377,699,720,527,872 floats (or 12 PB of storage, but do not
    need it for all queries)


In most cases do not need to do inference on the full network



                                     17
Results

Network      Memory        Time          Macbook       Hadoop
                           (19971)       Pro (20102)   (& future3)
Random       10 MB         33 sec            < 1 sec
(B)
Random       254 MB        260 sec       10 sec
(A)
cpcs360      2 GB          640 sec           15 sec    1 min
cpcs422       > 12 PB      N/A           N/A           Minutes to hours for
                                                       most of the queries on
                                                       most of the clusters

1‘used an SGI Origin 2000 machine with sixteen MIPS R10000 processors (195
      MHz clock speed)’ in 1997
2Macbook    Pro 4 GB DDR3 2.53 GHz
310   node Linux Xeon cluster 24 GB quad 2-core

                                        18
Conclusions


   Exact probabilistic inference is finally in sight for the full 422 node
    CPCS network
   Hadoop helps to solve the world’s hardest problems


         What you should know after this talk

BN is a DAG and represents a joint probability distribution (JPD)
Can compute conditional probabilities by multiplying and summing JPD
For large networks, this may be PBytes of intermediate data, but it’s MR




                                        19
Questions?


   alexvk@{cloudera,gmail}.com

Weitere ähnliche Inhalte

Was ist angesagt?

Double Patterning
Double PatterningDouble Patterning
Double PatterningDanny Luk
 
Hands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonHands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonChun-Ming Chang
 
Hands-on Tutorial of Deep Learning
Hands-on Tutorial of Deep LearningHands-on Tutorial of Deep Learning
Hands-on Tutorial of Deep LearningChun-Ming Chang
 
Birch Algorithm With Solved Example
Birch Algorithm With Solved ExampleBirch Algorithm With Solved Example
Birch Algorithm With Solved Examplekailash shaw
 
Spectral cnn
Spectral cnnSpectral cnn
Spectral cnnBrian Kim
 
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...Artem Lutov
 
Recent Advances in Kernel-Based Graph Classification
Recent Advances in Kernel-Based Graph ClassificationRecent Advances in Kernel-Based Graph Classification
Recent Advances in Kernel-Based Graph ClassificationChristopher Morris
 
DRAW: Deep Recurrent Attentive Writer
DRAW: Deep Recurrent Attentive WriterDRAW: Deep Recurrent Attentive Writer
DRAW: Deep Recurrent Attentive WriterMark Chang
 
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks Christopher Morris
 
Glocalized Weisfeiler-Lehman Graph Kernels: Global-Local Feature Maps of Graphs
Glocalized Weisfeiler-Lehman Graph Kernels: Global-Local Feature Maps of Graphs Glocalized Weisfeiler-Lehman Graph Kernels: Global-Local Feature Maps of Graphs
Glocalized Weisfeiler-Lehman Graph Kernels: Global-Local Feature Maps of Graphs Christopher Morris
 
Minimum Complexity Decoupling Networks for Arbitrary Coupled Loads
Minimum Complexity Decoupling Networks for Arbitrary Coupled LoadsMinimum Complexity Decoupling Networks for Arbitrary Coupled Loads
Minimum Complexity Decoupling Networks for Arbitrary Coupled LoadsDing Nie
 
Convolutional neural networks 이론과 응용
Convolutional neural networks 이론과 응용Convolutional neural networks 이론과 응용
Convolutional neural networks 이론과 응용홍배 김
 
Bandwidth Analysis of Low-Complexity Decoupling Networks for Multiple Coupled...
Bandwidth Analysis of Low-Complexity Decoupling Networks for Multiple Coupled...Bandwidth Analysis of Low-Complexity Decoupling Networks for Multiple Coupled...
Bandwidth Analysis of Low-Complexity Decoupling Networks for Multiple Coupled...Ding Nie
 
Nie_ISCAS2015
Nie_ISCAS2015Nie_ISCAS2015
Nie_ISCAS2015Ding Nie
 
Neural Networks: Model Building Through Linear Regression
Neural Networks: Model Building Through Linear RegressionNeural Networks: Model Building Through Linear Regression
Neural Networks: Model Building Through Linear RegressionMostafa G. M. Mostafa
 
Pixel RNN to Pixel CNN++
Pixel RNN to Pixel CNN++Pixel RNN to Pixel CNN++
Pixel RNN to Pixel CNN++Dongheon Lee
 

Was ist angesagt? (20)

Density based clustering
Density based clusteringDensity based clustering
Density based clustering
 
11 clusadvanced
11 clusadvanced11 clusadvanced
11 clusadvanced
 
Double Patterning
Double PatterningDouble Patterning
Double Patterning
 
Hands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonHands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in Python
 
Hands-on Tutorial of Deep Learning
Hands-on Tutorial of Deep LearningHands-on Tutorial of Deep Learning
Hands-on Tutorial of Deep Learning
 
Birch Algorithm With Solved Example
Birch Algorithm With Solved ExampleBirch Algorithm With Solved Example
Birch Algorithm With Solved Example
 
Spectral cnn
Spectral cnnSpectral cnn
Spectral cnn
 
Deeplab
DeeplabDeeplab
Deeplab
 
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
 
Recent Advances in Kernel-Based Graph Classification
Recent Advances in Kernel-Based Graph ClassificationRecent Advances in Kernel-Based Graph Classification
Recent Advances in Kernel-Based Graph Classification
 
DRAW: Deep Recurrent Attentive Writer
DRAW: Deep Recurrent Attentive WriterDRAW: Deep Recurrent Attentive Writer
DRAW: Deep Recurrent Attentive Writer
 
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks
 
Glocalized Weisfeiler-Lehman Graph Kernels: Global-Local Feature Maps of Graphs
Glocalized Weisfeiler-Lehman Graph Kernels: Global-Local Feature Maps of Graphs Glocalized Weisfeiler-Lehman Graph Kernels: Global-Local Feature Maps of Graphs
Glocalized Weisfeiler-Lehman Graph Kernels: Global-Local Feature Maps of Graphs
 
Minimum Complexity Decoupling Networks for Arbitrary Coupled Loads
Minimum Complexity Decoupling Networks for Arbitrary Coupled LoadsMinimum Complexity Decoupling Networks for Arbitrary Coupled Loads
Minimum Complexity Decoupling Networks for Arbitrary Coupled Loads
 
Convolutional neural networks 이론과 응용
Convolutional neural networks 이론과 응용Convolutional neural networks 이론과 응용
Convolutional neural networks 이론과 응용
 
Bandwidth Analysis of Low-Complexity Decoupling Networks for Multiple Coupled...
Bandwidth Analysis of Low-Complexity Decoupling Networks for Multiple Coupled...Bandwidth Analysis of Low-Complexity Decoupling Networks for Multiple Coupled...
Bandwidth Analysis of Low-Complexity Decoupling Networks for Multiple Coupled...
 
Nie_ISCAS2015
Nie_ISCAS2015Nie_ISCAS2015
Nie_ISCAS2015
 
Neural Networks: Model Building Through Linear Regression
Neural Networks: Model Building Through Linear RegressionNeural Networks: Model Building Through Linear Regression
Neural Networks: Model Building Through Linear Regression
 
Pixel RNN to Pixel CNN++
Pixel RNN to Pixel CNN++Pixel RNN to Pixel CNN++
Pixel RNN to Pixel CNN++
 
Birch1
Birch1Birch1
Birch1
 

Ähnlich wie Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)

Big data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsBig data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsDavid Gleich
 
Lucas Theis - Compressing Images with Neural Networks - Creative AI meetup
Lucas Theis - Compressing Images with Neural Networks - Creative AI meetupLucas Theis - Compressing Images with Neural Networks - Creative AI meetup
Lucas Theis - Compressing Images with Neural Networks - Creative AI meetupLuba Elliott
 
Surface-related multiple elimination through orthogonal encoding in the laten...
Surface-related multiple elimination through orthogonal encoding in the laten...Surface-related multiple elimination through orthogonal encoding in the laten...
Surface-related multiple elimination through orthogonal encoding in the laten...Oleg Ovcharenko
 
Igarss1792_v2.ppt
Igarss1792_v2.pptIgarss1792_v2.ppt
Igarss1792_v2.pptgrssieee
 
Practical spherical harmonics based PRT methods.ppsx
Practical spherical harmonics based PRT methods.ppsxPractical spherical harmonics based PRT methods.ppsx
Practical spherical harmonics based PRT methods.ppsxMannyK4
 
A MuDDy Experience - ML Bindings to a BDD Library
A MuDDy Experience - ML Bindings to a BDD LibraryA MuDDy Experience - ML Bindings to a BDD Library
A MuDDy Experience - ML Bindings to a BDD LibraryKen Friis Larsen
 
Graphs in the Database: Rdbms In The Social Networks Age
Graphs in the Database: Rdbms In The Social Networks AgeGraphs in the Database: Rdbms In The Social Networks Age
Graphs in the Database: Rdbms In The Social Networks AgeLorenzo Alberton
 
Towards controlling evolutionary dynamics through network geometry: some very...
Towards controlling evolutionary dynamics through network geometry: some very...Towards controlling evolutionary dynamics through network geometry: some very...
Towards controlling evolutionary dynamics through network geometry: some very...Kolja Kleineberg
 
Map reduce and the art of Thinking Parallel - Dr. Shailesh Kumar
Map reduce and the art of Thinking Parallel   - Dr. Shailesh KumarMap reduce and the art of Thinking Parallel   - Dr. Shailesh Kumar
Map reduce and the art of Thinking Parallel - Dr. Shailesh KumarHyderabad Scalability Meetup
 
Fast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and moreFast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and moreDavid Gleich
 
CS 354 Project 2 and Compression
CS 354 Project 2 and CompressionCS 354 Project 2 and Compression
CS 354 Project 2 and CompressionMark Kilgard
 
Fine grained asynchronism for pseudo-spectral codes - with application to tur...
Fine grained asynchronism for pseudo-spectral codes - with application to tur...Fine grained asynchronism for pseudo-spectral codes - with application to tur...
Fine grained asynchronism for pseudo-spectral codes - with application to tur...Ganesan Narayanasamy
 
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
 
MapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsMapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsDavid Gleich
 

Ähnlich wie Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010) (20)

Big data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsBig data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphs
 
Interactive High-Dimensional Visualization of Social Graphs
Interactive High-Dimensional Visualization of Social GraphsInteractive High-Dimensional Visualization of Social Graphs
Interactive High-Dimensional Visualization of Social Graphs
 
Lucas Theis - Compressing Images with Neural Networks - Creative AI meetup
Lucas Theis - Compressing Images with Neural Networks - Creative AI meetupLucas Theis - Compressing Images with Neural Networks - Creative AI meetup
Lucas Theis - Compressing Images with Neural Networks - Creative AI meetup
 
P2P Supernodes
P2P SupernodesP2P Supernodes
P2P Supernodes
 
Surface-related multiple elimination through orthogonal encoding in the laten...
Surface-related multiple elimination through orthogonal encoding in the laten...Surface-related multiple elimination through orthogonal encoding in the laten...
Surface-related multiple elimination through orthogonal encoding in the laten...
 
Pcm
PcmPcm
Pcm
 
Igarss1792_v2.ppt
Igarss1792_v2.pptIgarss1792_v2.ppt
Igarss1792_v2.ppt
 
Practical spherical harmonics based PRT methods.ppsx
Practical spherical harmonics based PRT methods.ppsxPractical spherical harmonics based PRT methods.ppsx
Practical spherical harmonics based PRT methods.ppsx
 
A MuDDy Experience - ML Bindings to a BDD Library
A MuDDy Experience - ML Bindings to a BDD LibraryA MuDDy Experience - ML Bindings to a BDD Library
A MuDDy Experience - ML Bindings to a BDD Library
 
Graphs in the Database: Rdbms In The Social Networks Age
Graphs in the Database: Rdbms In The Social Networks AgeGraphs in the Database: Rdbms In The Social Networks Age
Graphs in the Database: Rdbms In The Social Networks Age
 
Towards controlling evolutionary dynamics through network geometry: some very...
Towards controlling evolutionary dynamics through network geometry: some very...Towards controlling evolutionary dynamics through network geometry: some very...
Towards controlling evolutionary dynamics through network geometry: some very...
 
Map reduce and the art of Thinking Parallel - Dr. Shailesh Kumar
Map reduce and the art of Thinking Parallel   - Dr. Shailesh KumarMap reduce and the art of Thinking Parallel   - Dr. Shailesh Kumar
Map reduce and the art of Thinking Parallel - Dr. Shailesh Kumar
 
Fast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and moreFast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and more
 
CS 354 Project 2 and Compression
CS 354 Project 2 and CompressionCS 354 Project 2 and Compression
CS 354 Project 2 and Compression
 
Fine grained asynchronism for pseudo-spectral codes - with application to tur...
Fine grained asynchronism for pseudo-spectral codes - with application to tur...Fine grained asynchronism for pseudo-spectral codes - with application to tur...
Fine grained asynchronism for pseudo-spectral codes - with application to tur...
 
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
 
Kailash(13EC35032)_mtp.pptx
Kailash(13EC35032)_mtp.pptxKailash(13EC35032)_mtp.pptx
Kailash(13EC35032)_mtp.pptx
 
Defense
DefenseDefense
Defense
 
PhD defense slides
PhD defense slidesPhD defense slides
PhD defense slides
 
MapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsMapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applications
 

Kürzlich hochgeladen

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 

Kürzlich hochgeladen (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 

Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)

  • 1. Exact Inference in Bayesian Networks using MapReduce Alex Kozlov Cloudera, Inc.
  • 2. Session Agenda  About Me  About Cloudera  Bayesian (Probabilistic) Networks  BN Inference 101  CPCS Network  Why BN Inference  Inference with MR  Results  Conclusions 2
  • 3. About Me  Worked on BN Inference in 1995-1998 (for Ph.D.) › Published the fastest implementation at the time  Worked on DM/BI field since then  Recently joined Cloudera, Inc. › Started looking at how to solve world’s hardest problems 3
  • 4. About Cloudera Founded in the summer 2008 Cloudera helps organizations profit from all of their data. We deliver the industry-standard platform which consolidates, stores and processes any kind of data, from any source, at scale. We make it possible to do more powerful analysis of more kinds of data, at scale, than ever before. With Cloudera, you get better insight into their customers, partners, vendors and businesses. Cloudera’s platform is built on the popular open source Apache Hadoop project. We deliver the innovative work of a global community of contributors in a package that makes it easy for anyone to put the power of Google, Facebook and Yahoo! to work on their own problems. 4
  • 5. Bayesian Networks 1. Nodes 2. Edges 3. Probabilities Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances, published posthumously by his friend Philosophical Transactions of the Royal Society of London, 53:370-418 5
  • 6. Applications 1. Computational biology and bioinformatics (gene regulatory networks, protein structure, gene expression analysis) 2. Medicine 3. Document classification, information retrieval 4. Image processing 5. Data fusion 6. Gaming 7. Law 8. On-line advertising! 6
  • 7. A Simple BN Network Rain T F Rain T F F 0.4 0.6 T 0.1 0.9 0.2 0.8 Sprinkler Sprinkler, Rain T F F, F 0.01 0.99 Wet F, T 0.8 0.2 Driveway T, F 0.9 0.1 T, T 0.99 0.01 Pr(Rain | Wet Driveway) Pr(Sprinkler Broken | !Wet Driveway & !Rain) 7
  • 8. Asia Network Pr(Visit to Asia) Pr(Lung Cancer | Smoking) Pr(Smoking) Pr(Tuberculosis | Visit to Asia) Pr(Bronchitis | Smoking) Pr(C | BE ) Pr(X-Ray | Lung Cancer or Tuberculosis) Pr(Dyspnea | CG ) Pr(Lung Cancer | Neg X-Ray & Positive Dyspnea) 8
  • 9. BN Inference 101 (in Hive) JPD = <product of all probabilities and conditional probabilities in the network> = Pr(A, B, …, H) PAB = SELECT A, B, SUM(PROB) FROM JPD GROUP BY A, B; PB = SELECT B, SUM(PROB) FROM PAB GROUP BY A; Pr(A|B) = Pr(A,B)/Pr(B) – Bayes’ rule CPCS is 422 nodes, a table of at least 2422 rows! 9
  • 10. Junction Tree Pr(E | F ) Pr(Tuberculosis | Visit to Asia) Pr(G | F ) Pr(Visit to Asia) Pr(F) Pr(C | BE ) Pr(H | CG ) Pr(Lung Cancer | Dyspnea) = Pr(E|H) Pr(D| C) 10
  • 11. CPCS Networks 422 nodes 14 nodes describe diseases 33 risk factors 375 various findings related to diseases 11
  • 13. Why Bayesian Network Inference? Choose the right tool for the right job!  BN is an abstraction for reasoning and decision making  Easy to incorporate human insight and intuitions  Very general, no specific ‘label’ node  Easy to do ‘what-if’, strength of influence, value of information, analysis  Immune to Gaussian assumptions It’s all just a joint probability distribution 13
  • 14. Map & Reduces Map Keys B1C1E1 A1B1 B1C1E2 Reduce A2B1 B1 B1C2E1 B1C2E2 A1B2 B2C1E1 A2B2 B2 B2C1E2 ∑ Pr(B1| A) x ∑ Pr(D| C1) B2C2E1 B2C2E2 B1C1E1 C1D1 B1C1E2 Pr(C| BE) x ∑ Pr(B1| A) x ∑ Pr(D| C1) C2D1 C1 B1C2E1 B1C2E2 Aggregation 2 (x) C1D2 B2C1E1 C2D2 C2 B2C1E2 B2C2E1 BCE B2C2E2 Aggregation 1 (+) 14
  • 15. MapReduce Implementation for each clique in depth-first order: MAP: Sum over the variables to get ‘clique message’ (requires state, custom partitioner and input format) Emit factors for the next clique REDUCE: Multiply the factors from all children Include probabilities assigned to the clique Form the new clique values the MAP is done over all child cliques 15
  • 16. Cliques, Trees, and Parallelism C6 o Topological parallelism: compute branches C2 and C4 in parallel C5 o Clique parallelism: divide computation of each clique into maps/reducers C4 o Fall back into optimal factoring if a corresponding subtree is small C3 o Combine multiple phases together C2 o Reduce replication level C1 Cliques may be larger than they appear! 16
  • 17. CPCS Inference CPCS: The 360-node subnet has the largest ‘clique’ of 11,739,896 floats (fits into 2GB) The full 422-node version (absent, mild, moderate, severe) 3,377,699,720,527,872 floats (or 12 PB of storage, but do not need it for all queries) In most cases do not need to do inference on the full network 17
  • 18. Results Network Memory Time Macbook Hadoop (19971) Pro (20102) (& future3) Random 10 MB 33 sec < 1 sec (B) Random 254 MB 260 sec 10 sec (A) cpcs360 2 GB 640 sec 15 sec 1 min cpcs422 > 12 PB N/A N/A Minutes to hours for most of the queries on most of the clusters 1‘used an SGI Origin 2000 machine with sixteen MIPS R10000 processors (195 MHz clock speed)’ in 1997 2Macbook Pro 4 GB DDR3 2.53 GHz 310 node Linux Xeon cluster 24 GB quad 2-core 18
  • 19. Conclusions  Exact probabilistic inference is finally in sight for the full 422 node CPCS network  Hadoop helps to solve the world’s hardest problems What you should know after this talk BN is a DAG and represents a joint probability distribution (JPD) Can compute conditional probabilities by multiplying and summing JPD For large networks, this may be PBytes of intermediate data, but it’s MR 19
  • 20. Questions? alexvk@{cloudera,gmail}.com