SlideShare ist ein Scribd-Unternehmen logo
1 von 32
Bayesian	
  Counters	
  
aka	
  In	
  Memory	
  Data	
  Mining	
  for	
  Large	
  DataSets	
  
  Alex	
  Kozlov,	
  Ph.D.,	
  Principal	
  Solutions	
  Architect,	
  Cloudera	
  Inc.	
  

  @alexvk2009	
  (Twitter)	
  
June	
  13-­‐th,	
  2012	
  
My	
  past	
  (aka	
  about	
  me)	
  
Agenda	
  
•  Current	
  trends	
  	
  (large	
  data,	
  real	
  time,	
  uncertainty)	
  
•  What	
  is	
  Bayesian	
  Counters	
  
•  Naïve	
  Bayes	
  
•  NN	
  
•  Clique	
  ranking	
  
•  Association	
  Rules	
  
•  Some	
  performance	
  results	
  
•  Conclusions	
  

                                  ©2012	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved.	
     4	
  
About	
  Cloudera	
  
Entering	
  it’s	
  5-­‐th	
  year	
  
	
  
             Cloudera’s	
  mission	
  is	
  to	
  help	
  organizations	
  profit	
  from	
  all	
  of	
  their	
  data.	
  
                                                                        	
  
Cloudera	
  helps	
  organizations	
  profit	
  from	
  all	
  of	
  their	
  data.	
  We	
  deliver	
  the	
  industry-­‐standard	
  
platform	
  which	
  consolidates,	
  stores	
  and	
  processes	
  any	
  kind	
  of	
  data,	
  from	
  any	
  source,	
  at	
  
scale.	
  We	
  make	
  it	
  possible	
  to	
  do	
  more	
  powerful	
  analysis	
  of	
  more	
  kinds	
  of	
  data,	
  at	
  scale,	
  
than	
  ever	
  before.	
  With	
  Cloudera,	
  you	
  get	
  better	
  insight	
  into	
  their	
  customers,	
  partners,	
  
vendors	
  and	
  businesses.	
  
	
  




                                                                                                                                               5	
  
A	
  Distributed	
  System	
  
Centralized	
                              Distributed	
  

•  SPoF	
                                  •  Availability	
  

•  Strict	
  synchronization/Locking	
     •  Redundancy/Fault	
  Tolerance	
  

•  Better	
  Resource	
  Management	
      •  Flexible	
  

                                           •  Interactive	
  
Data	
  collection	
  
State	
  space	
  explosion	
  
•  Chess	
  alpha-­‐beta	
  tree	
  has	
  1045	
  nodes	
  
•  We	
  can	
  solve	
  only	
  1018	
  state	
  space	
  
•  Go	
  has	
  10360	
  nodes	
  
•  Given	
  the	
  Moore’s	
  law	
  we’ll	
  be	
  there	
  only	
  by	
  2120	
  
                                     	
  Can	
  we	
  help?	
  
                       Uncertainty	
  rules	
  the	
  world!	
  
                        Or	
  use	
  distributed	
  systems	
  
More	
  zeros	
  

•  Most	
  powerful	
  computer	
  (2019):	
  1024	
  ops/sec	
  

•  Seconds	
  in	
  a	
  year:	
  3	
  x	
  107	
  seconds	
  

•  Sun’s	
  expected	
  life:	
  107	
  years	
  

         We	
  can	
  probably	
  be	
  done	
  with	
  chess!	
  
Time	
  
Examples	
                                            Value	
  vs	
  time	
  

                                                      	
  
•  Advertising:	
  if	
  you	
  don’t	
  figure	
  
    what	
  the	
  user	
  wants	
  in	
  5	
  
    minutes,	
  you	
  lost	
  him	
  
•  Intrusion	
  detection:	
  the	
  
                                                             0	
   1	
   2	
   3	
   4	
   5	
   6	
   7	
   8	
   9	
  
    damage	
  may	
  be	
  significantly	
  
    bigger	
  after	
  a	
  few	
  minutes	
                                 Value	
         Precision	
  
    after	
  break-­‐in	
  
•  Missing/misconfigured	
  pages	
                           http://cetas.net	
  
                                                             http://www.woopra.com	
  
                                                             http://www.wibidata.com/	
  	
  
What	
  we’ve	
  learned	
  so	
  far	
  
•  There	
  is	
  a	
  lot	
  of	
  data	
  out	
  there	
  
•  The	
  storage	
  capacity	
  of	
  	
  a	
  distributed	
  systems	
  
   today	
  is	
  overwhelming	
  
•  We	
  need	
  to	
  admit	
  that	
  some	
  problems	
  will	
  
   never	
  be	
  solved	
  
•  Time	
  is	
  a	
  critical	
  factor	
  
Why	
  (not)	
  to	
  Mine	
  from	
  HD?	
  
•  L1	
  Cache:	
  64	
  bits	
  per	
  CPU	
  clock	
         •  Move	
  computation	
  to	
  the	
  data:	
  

      cycle	
  (10-­‐9	
  sec)	
  1010	
  bytes	
  per	
              but	
  ML	
  wants	
  all	
  your	
  data!	
  

      second,	
  latency	
  in	
  ns	
                         •  And	
  sorted…	
  

•  HD	
  –	
  12	
  x	
  100	
  x	
  106	
  bytes	
  per	
  
      second,	
  latency	
  in	
  ms	
  
                                                                             What	
  if	
  it	
  does	
  not	
  fit	
  in	
  
•  Network	
  –	
  10	
  GbE	
  switches	
                                                  RAM?	
  
      (depends	
  on	
  distance,	
  topology)	
  
•  East-­‐West	
  coast	
  latency	
  20-­‐40	
  ms	
          	
  
      (ms	
  within	
  a	
  datacenter)	
                      •  Work	
  on	
  reasonable	
  subsets	
  
Push	
  computations	
  to	
  the	
  source	
  

•  Collect	
  relevant	
  information	
  at	
  the	
  source	
  
  (pairwise	
  correlations,	
  can	
  be	
  done	
  in	
  parallel	
  
  using	
  Hbase)	
  

Compare:	
  
      -­‐>	
  computations	
  to	
  data	
  =	
  MapReduce	
  

      -­‐>	
  data	
  to	
  computations	
  =	
  map	
  side	
  join	
  
Bayesian	
  Counters	
  
                                    •  [A=a1;B=b1]	
  -­‐>	
  5	
  

                                    •  [A=a1;B=b2]	
  -­‐>	
  15	
  

Pr(A|B)	
  =	
  Pr(AB)/Pr(B)	
      •  …	
  

  =	
  	
  Count(AB)/Count(B)	
     •  [A=a2;B=b1]	
  -­‐>	
  3	
  

                                    •  …	
  

                                    	
  
Time	
  
                                                   What	
  if	
  we	
  want	
  to	
  access	
  more	
  
                                                     recent	
  data	
  more	
  often?	
  
                                                                            	
  
•      Key:	
  subset	
  of	
  variables	
  with	
  their	
  values	
  +	
  timestamp	
  (variable	
  length)	
  
•      Value:	
  count	
  (8	
  bytes)	
  

                                                                                                                                      index	
  

           Key	
  1	
       Value	
          Key	
  2	
     Value	
        Key	
  3	
      Value	
         Key	
  4	
     Value	
  


	
  
          Column	
  families	
  are	
  different	
  HFiles	
  (30	
  min,	
  2	
  hours,	
  24	
  hours,	
  5	
  days,	
  etc.)	
  


                                              Pr(A|B,	
  last	
  20	
  minutes)	
  	
  
Anatomy	
  of	
  a	
  counter	
  
                                  Region	
  (divide	
  between)	
  
       Counter/Table	
  
                                                  File	
          Column	
  family	
  
 Iris	
  
   [sepal_width=2;class=0]	
                                      Column	
  qualifier	
  
                            30	
  mins	
  

                                       1321038671	
                              Version	
  
                                                 1321038998	
  

                                                                      15	
  
                            2	
  hours	
  
                                                                               Value	
  (data)	
  
Cars	
                                                        …	
  
File/Memory	
  Structure	
  
HBase	
  schema	
  design	
  

•  Push	
  computations	
  into	
  distributed	
  realm	
  

•  Column	
  family	
  for	
  data	
  locality	
  

•  Key	
  is	
  a	
  tuple	
  of	
  var=value	
  combinations	
  

•  No	
  random	
  salt	
  

•  Value	
  is	
  a	
  counter	
  (8	
  bytes)	
  
Implementations	
  

•  Naïve	
  Bayes	
  

•  Nearest	
  Neighbor	
  

•  Association	
  rules	
  

•  Clique	
  ranking	
  
Naïve	
  Bayes	
  


Pr(C|F1,	
  F2,	
  ...,	
  FN)	
  =1/z	
  Pr(C)	
        Πi	
  Pr(F |C)	
  
                                                                     i


Required	
  only	
  pairwise	
  counters	
  (complexity	
  N2)	
  

	
  
*Linear	
  if	
  we	
  fix	
  the	
  target	
  node	
  
k-­‐NN	
  


            P(C)	
  for	
  k	
  nearest	
  neighbors	
  

            count(C|X)	
  =	
  ΣXi	
  count(C|Xi)	
  

where	
  X1,	
  X2,	
  ...,	
  XN	
  are	
  in	
  the	
  vicinity	
  of	
  X	
  
Clique	
  ranking	
  
What	
  is	
  the	
  best	
  structure	
  of	
  a	
  Bayesian	
  Network	
  

        I(X;Y)=ΣΣp(x,y)log[p(x,y)/p(x)p(y)]	
  

                  Where	
  x	
  in	
  X	
  and	
  y	
  in	
  Y	
  

 Using	
  random	
  projection	
  can	
  generalize	
  on	
  
                       abstract	
  subset	
  Z	
  
Assoc	
  
•  Confidence	
  (A	
  -­‐>	
  B):	
  count(A	
  and	
  B)/count(A)	
  

•  Lift	
  (A	
  -­‐>	
  B):	
  count(A	
  and	
  B)/[count(A)	
  x	
  count(B)]	
  



•  Usually	
  filtered	
  on	
  support:	
  count(A	
  and	
  B)	
  	
  

•  Frequent	
  itemset	
  search	
  
Performance	
  

retail.dat	
  –	
  88K	
  transactions	
  over	
  14,246	
  items	
  

•  Mahout	
  FPGrowth	
  –	
  0.5	
  sec	
  per	
  pattern	
  
   (58,623	
  patterns	
  with	
  min	
  support	
  2)	
  

•  	
  <	
  1	
  ms	
  per	
  pattern	
  on	
  a	
  5	
  node	
  cluster	
  
FPGrowth	
  performance	
  

Row	
             Support	
              	
  Rules	
  	
               	
  Time(ms)	
  	
  
          1	
           1	
      	
  69,309	
  	
            	
  25,659,052	
  	
  
          2	
           2	
         	
  58,623	
  	
              	
  23,103,547	
  	
  
          3	
          4	
         	
  48,270	
  	
             	
  20,782,325	
  	
  
          4	
          8	
         	
  38,661	
  	
            	
  17,643,592	
  	
  
          5	
         16	
      	
  28,988	
  	
               	
  13,994,334	
  	
  
          6	
         32	
        	
  19,939	
  	
                    	
  9,714,935	
  	
  
FPGrowth	
  performance	
  
FPGrowth	
  performance	
  
Time	
  
   nb	
  iris	
  class=2	
  sepal_length=5;petal_length=1.4	
  300	
  



Target	
  Variable	
                      Time	
  (seconds	
  from	
  now)	
  




                         Predictors	
  
Conclusions	
  
•  Storing	
  n-­‐wise	
  counts	
  is	
  a	
  powerful	
  data	
  
   analysis	
  paradigm	
  
•  We	
  can	
  implement	
  a	
  number	
  of	
  powerful	
  
   algorithms	
  on	
  top	
  of	
  counters	
  
•  A	
  system	
  that	
  will	
  know	
  about	
  the	
  world	
  more	
  
   than	
  you	
  would	
  ever	
  dare	
  to	
  admit	
  
Future	
  Directions	
  
•  Direct	
  extensions:	
  
     –  Dynamic	
  adjustment	
  of	
  counters	
  to	
  collect	
  
     –  Dynamic	
  adjustment	
  to	
  time	
  buckets	
  
     –  Optimization	
  


•  Testing	
  problems:	
  
     –  Can	
  not	
  directly	
  compare	
  to	
  static	
  algos	
  


•  More	
  general:	
  
     –  Better	
  data	
  management	
  tools	
  for	
  machine	
  learning	
  



                                                                                  30	
  
Thank	
  you!	
  	
  




                        31	
  
Questions?	
  




                                    freenode:	
  #cloudera	
  /	
  #hadoop	
  
                                    http://www.cloudera.com	
  
Do	
  not	
  hesitate	
  to	
  email	
  alexvk@{gmail,cloudera}.com	
  
                                                                                    32	
  
                     ©2012	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved.	
  

Weitere ähnliche Inhalte

Was ist angesagt?

Intro to Big Data using Hadoop
Intro to Big Data using Hadoop Intro to Big Data using Hadoop
Intro to Big Data using Hadoop Sergejus Barinovas
 
Distributed batch processing with Hadoop
Distributed batch processing with HadoopDistributed batch processing with Hadoop
Distributed batch processing with HadoopFerran Galí Reniu
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSPeterAndreasEntschev
 
Realtime Per Face Texture Mapping (PTEX)
Realtime Per Face Texture Mapping (PTEX)Realtime Per Face Texture Mapping (PTEX)
Realtime Per Face Texture Mapping (PTEX)basisspace
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceDr Ganesh Iyer
 
Introduction to Deep Learning with Python
Introduction to Deep Learning with PythonIntroduction to Deep Learning with Python
Introduction to Deep Learning with Pythonindico data
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataJetlore
 
[243] turning data into value
[243] turning data into value[243] turning data into value
[243] turning data into valueNAVER D2
 
Intro to TensorFlow and PyTorch Workshop at Tubular Labs
Intro to TensorFlow and PyTorch Workshop at Tubular LabsIntro to TensorFlow and PyTorch Workshop at Tubular Labs
Intro to TensorFlow and PyTorch Workshop at Tubular LabsKendall
 
Scaling Deep Learning with MXNet
Scaling Deep Learning with MXNetScaling Deep Learning with MXNet
Scaling Deep Learning with MXNetAI Frontiers
 
BDAS Shark study report 03 v1.1
BDAS Shark study report  03 v1.1BDAS Shark study report  03 v1.1
BDAS Shark study report 03 v1.1Stefanie Zhao
 
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesGoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesDataWorks Summit/Hadoop Summit
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Milind Bhandarkar
 
Nearest Neighbor Customer Insight
Nearest Neighbor Customer InsightNearest Neighbor Customer Insight
Nearest Neighbor Customer InsightMapR Technologies
 
How to win data science competitions with Deep Learning
How to win data science competitions with Deep LearningHow to win data science competitions with Deep Learning
How to win data science competitions with Deep LearningSri Ambati
 
Scaling Python to CPUs and GPUs
Scaling Python to CPUs and GPUsScaling Python to CPUs and GPUs
Scaling Python to CPUs and GPUsTravis Oliphant
 

Was ist angesagt? (20)

Intro to Big Data using Hadoop
Intro to Big Data using Hadoop Intro to Big Data using Hadoop
Intro to Big Data using Hadoop
 
Distributed batch processing with Hadoop
Distributed batch processing with HadoopDistributed batch processing with Hadoop
Distributed batch processing with Hadoop
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
 
Realtime Per Face Texture Mapping (PTEX)
Realtime Per Face Texture Mapping (PTEX)Realtime Per Face Texture Mapping (PTEX)
Realtime Per Face Texture Mapping (PTEX)
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Introduction to Deep Learning with Python
Introduction to Deep Learning with PythonIntroduction to Deep Learning with Python
Introduction to Deep Learning with Python
 
Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
 
[243] turning data into value
[243] turning data into value[243] turning data into value
[243] turning data into value
 
ACM 2013-02-25
ACM 2013-02-25ACM 2013-02-25
ACM 2013-02-25
 
Intro to TensorFlow and PyTorch Workshop at Tubular Labs
Intro to TensorFlow and PyTorch Workshop at Tubular LabsIntro to TensorFlow and PyTorch Workshop at Tubular Labs
Intro to TensorFlow and PyTorch Workshop at Tubular Labs
 
Scaling Deep Learning with MXNet
Scaling Deep Learning with MXNetScaling Deep Learning with MXNet
Scaling Deep Learning with MXNet
 
lec6_ref.pdf
lec6_ref.pdflec6_ref.pdf
lec6_ref.pdf
 
BDAS Shark study report 03 v1.1
BDAS Shark study report  03 v1.1BDAS Shark study report  03 v1.1
BDAS Shark study report 03 v1.1
 
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesGoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with Dependencies
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
 
Nearest Neighbor Customer Insight
Nearest Neighbor Customer InsightNearest Neighbor Customer Insight
Nearest Neighbor Customer Insight
 
How to win data science competitions with Deep Learning
How to win data science competitions with Deep LearningHow to win data science competitions with Deep Learning
How to win data science competitions with Deep Learning
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Scaling Python to CPUs and GPUs
Scaling Python to CPUs and GPUsScaling Python to CPUs and GPUs
Scaling Python to CPUs and GPUs
 

Ähnlich wie Bayesian Counters

Fixing twitter
Fixing twitterFixing twitter
Fixing twitterRoger Xia
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...smallerror
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...xlight
 
Modern software design in Big data era
Modern software design in Big data eraModern software design in Big data era
Modern software design in Big data eraBill GU
 
Lessons from lhc
Lessons from lhcLessons from lhc
Lessons from lhcdrsm79
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In DepthFabio Fumarola
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable PythonTravis Oliphant
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudyJohn Adams
 
Storm presentation
Storm presentationStorm presentation
Storm presentationShyam Raj
 
A Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataA Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataDatabricks
 
Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaDataStax Academy
 
Cassandra background-and-architecture
Cassandra background-and-architectureCassandra background-and-architecture
Cassandra background-and-architectureMarkus Klems
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoopMohit Tare
 
Re-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseRe-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseAll Things Open
 
Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Introduce Apache Cassandra - JavaTwo Taiwan, 2012Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Introduce Apache Cassandra - JavaTwo Taiwan, 2012Boris Yen
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapSrinath Perera
 

Ähnlich wie Bayesian Counters (20)

Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
 
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Modern software design in Big data era
Modern software design in Big data eraModern software design in Big data era
Modern software design in Big data era
 
Lessons from lhc
Lessons from lhcLessons from lhc
Lessons from lhc
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
 
L6.sp17.pptx
L6.sp17.pptxL6.sp17.pptx
L6.sp17.pptx
 
ENAR short course
ENAR short courseENAR short course
ENAR short course
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 
Storm presentation
Storm presentationStorm presentation
Storm presentation
 
A Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataA Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big Data
 
Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in China
 
Cassandra background-and-architecture
Cassandra background-and-architectureCassandra background-and-architecture
Cassandra background-and-architecture
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Re-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseRe-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series Database
 
Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Introduce Apache Cassandra - JavaTwo Taiwan, 2012Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Introduce Apache Cassandra - JavaTwo Taiwan, 2012
 
Cassandra training
Cassandra trainingCassandra training
Cassandra training
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and Roadmap
 

Mehr von DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mehr von DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Kürzlich hochgeladen

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 

Kürzlich hochgeladen (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Bayesian Counters

  • 1. Bayesian  Counters   aka  In  Memory  Data  Mining  for  Large  DataSets   Alex  Kozlov,  Ph.D.,  Principal  Solutions  Architect,  Cloudera  Inc.   @alexvk2009  (Twitter)   June  13-­‐th,  2012  
  • 2.
  • 3. My  past  (aka  about  me)  
  • 4. Agenda   •  Current  trends    (large  data,  real  time,  uncertainty)   •  What  is  Bayesian  Counters   •  Naïve  Bayes   •  NN   •  Clique  ranking   •  Association  Rules   •  Some  performance  results   •  Conclusions   ©2012  Cloudera,  Inc.  All  Rights  Reserved.   4  
  • 5. About  Cloudera   Entering  it’s  5-­‐th  year     Cloudera’s  mission  is  to  help  organizations  profit  from  all  of  their  data.     Cloudera  helps  organizations  profit  from  all  of  their  data.  We  deliver  the  industry-­‐standard   platform  which  consolidates,  stores  and  processes  any  kind  of  data,  from  any  source,  at   scale.  We  make  it  possible  to  do  more  powerful  analysis  of  more  kinds  of  data,  at  scale,   than  ever  before.  With  Cloudera,  you  get  better  insight  into  their  customers,  partners,   vendors  and  businesses.     5  
  • 6. A  Distributed  System   Centralized   Distributed   •  SPoF   •  Availability   •  Strict  synchronization/Locking   •  Redundancy/Fault  Tolerance   •  Better  Resource  Management   •  Flexible   •  Interactive  
  • 8. State  space  explosion   •  Chess  alpha-­‐beta  tree  has  1045  nodes   •  We  can  solve  only  1018  state  space   •  Go  has  10360  nodes   •  Given  the  Moore’s  law  we’ll  be  there  only  by  2120    Can  we  help?   Uncertainty  rules  the  world!   Or  use  distributed  systems  
  • 9. More  zeros   •  Most  powerful  computer  (2019):  1024  ops/sec   •  Seconds  in  a  year:  3  x  107  seconds   •  Sun’s  expected  life:  107  years   We  can  probably  be  done  with  chess!  
  • 10. Time   Examples   Value  vs  time     •  Advertising:  if  you  don’t  figure   what  the  user  wants  in  5   minutes,  you  lost  him   •  Intrusion  detection:  the   0   1   2   3   4   5   6   7   8   9   damage  may  be  significantly   bigger  after  a  few  minutes   Value   Precision   after  break-­‐in   •  Missing/misconfigured  pages   http://cetas.net   http://www.woopra.com   http://www.wibidata.com/    
  • 11. What  we’ve  learned  so  far   •  There  is  a  lot  of  data  out  there   •  The  storage  capacity  of    a  distributed  systems   today  is  overwhelming   •  We  need  to  admit  that  some  problems  will   never  be  solved   •  Time  is  a  critical  factor  
  • 12. Why  (not)  to  Mine  from  HD?   •  L1  Cache:  64  bits  per  CPU  clock   •  Move  computation  to  the  data:   cycle  (10-­‐9  sec)  1010  bytes  per   but  ML  wants  all  your  data!   second,  latency  in  ns   •  And  sorted…   •  HD  –  12  x  100  x  106  bytes  per   second,  latency  in  ms   What  if  it  does  not  fit  in   •  Network  –  10  GbE  switches   RAM?   (depends  on  distance,  topology)   •  East-­‐West  coast  latency  20-­‐40  ms     (ms  within  a  datacenter)   •  Work  on  reasonable  subsets  
  • 13. Push  computations  to  the  source   •  Collect  relevant  information  at  the  source   (pairwise  correlations,  can  be  done  in  parallel   using  Hbase)   Compare:   -­‐>  computations  to  data  =  MapReduce   -­‐>  data  to  computations  =  map  side  join  
  • 14. Bayesian  Counters   •  [A=a1;B=b1]  -­‐>  5   •  [A=a1;B=b2]  -­‐>  15   Pr(A|B)  =  Pr(AB)/Pr(B)   •  …   =    Count(AB)/Count(B)   •  [A=a2;B=b1]  -­‐>  3   •  …    
  • 15. Time   What  if  we  want  to  access  more   recent  data  more  often?     •  Key:  subset  of  variables  with  their  values  +  timestamp  (variable  length)   •  Value:  count  (8  bytes)   index   Key  1   Value   Key  2   Value   Key  3   Value   Key  4   Value     Column  families  are  different  HFiles  (30  min,  2  hours,  24  hours,  5  days,  etc.)   Pr(A|B,  last  20  minutes)    
  • 16. Anatomy  of  a  counter   Region  (divide  between)   Counter/Table   File   Column  family   Iris   [sepal_width=2;class=0]   Column  qualifier   30  mins   1321038671   Version   1321038998   15   2  hours   Value  (data)   Cars   …  
  • 18. HBase  schema  design   •  Push  computations  into  distributed  realm   •  Column  family  for  data  locality   •  Key  is  a  tuple  of  var=value  combinations   •  No  random  salt   •  Value  is  a  counter  (8  bytes)  
  • 19. Implementations   •  Naïve  Bayes   •  Nearest  Neighbor   •  Association  rules   •  Clique  ranking  
  • 20. Naïve  Bayes   Pr(C|F1,  F2,  ...,  FN)  =1/z  Pr(C)   Πi  Pr(F |C)   i Required  only  pairwise  counters  (complexity  N2)     *Linear  if  we  fix  the  target  node  
  • 21. k-­‐NN   P(C)  for  k  nearest  neighbors   count(C|X)  =  ΣXi  count(C|Xi)   where  X1,  X2,  ...,  XN  are  in  the  vicinity  of  X  
  • 22. Clique  ranking   What  is  the  best  structure  of  a  Bayesian  Network   I(X;Y)=ΣΣp(x,y)log[p(x,y)/p(x)p(y)]   Where  x  in  X  and  y  in  Y   Using  random  projection  can  generalize  on   abstract  subset  Z  
  • 23. Assoc   •  Confidence  (A  -­‐>  B):  count(A  and  B)/count(A)   •  Lift  (A  -­‐>  B):  count(A  and  B)/[count(A)  x  count(B)]   •  Usually  filtered  on  support:  count(A  and  B)     •  Frequent  itemset  search  
  • 24. Performance   retail.dat  –  88K  transactions  over  14,246  items   •  Mahout  FPGrowth  –  0.5  sec  per  pattern   (58,623  patterns  with  min  support  2)   •   <  1  ms  per  pattern  on  a  5  node  cluster  
  • 25. FPGrowth  performance   Row   Support    Rules      Time(ms)     1   1    69,309      25,659,052     2   2    58,623      23,103,547     3   4    48,270      20,782,325     4   8    38,661      17,643,592     5   16    28,988      13,994,334     6   32    19,939      9,714,935    
  • 28. Time   nb  iris  class=2  sepal_length=5;petal_length=1.4  300   Target  Variable   Time  (seconds  from  now)   Predictors  
  • 29. Conclusions   •  Storing  n-­‐wise  counts  is  a  powerful  data   analysis  paradigm   •  We  can  implement  a  number  of  powerful   algorithms  on  top  of  counters   •  A  system  that  will  know  about  the  world  more   than  you  would  ever  dare  to  admit  
  • 30. Future  Directions   •  Direct  extensions:   –  Dynamic  adjustment  of  counters  to  collect   –  Dynamic  adjustment  to  time  buckets   –  Optimization   •  Testing  problems:   –  Can  not  directly  compare  to  static  algos   •  More  general:   –  Better  data  management  tools  for  machine  learning   30  
  • 31. Thank  you!     31  
  • 32. Questions?   freenode:  #cloudera  /  #hadoop   http://www.cloudera.com   Do  not  hesitate  to  email  alexvk@{gmail,cloudera}.com   32   ©2012  Cloudera,  Inc.  All  Rights  Reserved.