SlideShare ist ein Scribd-Unternehmen logo
1 von 22
Bringing	
  the	
  excitement	
  back	
  to	
  
             data	
  analysis	
  

                               MC	
  Brown	
  
                      VP,	
  TechPubs	
  and	
  Educa?on	
  




                                                               1	
  
In	
  the	
  year	
  1992….	
  

 •  Freetext	
  Database	
  =	
  Document/NoSQL	
  Database	
  
 •  Massive	
  Datasets	
  
     –  19043	
  records!!!	
  
     –  Approx.	
  8k	
  per	
  record	
  




                                                                  2	
  
The	
  Drug	
  

 •    Data	
  Analysis	
  was	
  ‘Exci?ng’	
  
 •    2-­‐3	
  days	
  to	
  write	
  the	
  analysis	
  program	
  
 •    Processing	
  would	
  occur	
  overnight	
  
 •    Sta?s?cs	
  required	
  ‘whole	
  set’	
  processing	
  




                                                                       3	
  
The	
  Hit	
  

 •  Mornings	
  were	
  ‘the	
  hit’	
  

 	
  
 	
  
 •  The	
  joy	
  of	
  real	
  data	
  analysis	
  is	
  the	
  
      output	
  of	
  a	
  good	
  report	
  
 •  Get	
  good	
  stats	
  
      –  I	
  know	
  how	
  many	
  teachers	
  teach	
  Geography	
  in	
  Scotland!	
  
      –  I	
  know	
  400	
  people	
  have	
  purchased	
  our	
  History	
  so]ware!	
  
 •  The	
  wait	
  and	
  the	
  results	
  kept	
  us	
  working	
  

                                                                                             4	
  
In	
  the	
  year	
  2002	
  

 •    Grid	
  compu?ng	
  was	
  the	
  drug	
  
 •    Building	
  200-­‐2000	
  node	
  grid	
  systems	
  
 •    Analysis	
  could	
  happen	
  the	
  same	
  day	
  
 •    Datasets	
  could	
  be	
  huge	
  
       –  They	
  just	
  took	
  more	
  hours	
  
 •  S?ll	
  working	
  on	
  en?re	
  datasets	
  
       –  Sta?s?cs	
  s?ll	
  required	
  whole	
  set	
  process	
  
 •  Jobs	
  became	
  monotonous	
  
 •  More	
  about	
  construc?on	
  and	
  technology	
  than	
  stats	
  
 	
  
                                                                             5	
  
In	
  the	
  year	
  2012	
  

 •  Need	
  info	
  and	
  sta?s?cs	
  quicker	
  than	
  ever	
  
 •  Database	
  clusters	
  provide	
  the	
  backbone	
  
      –  Grids	
  without	
  the	
  headache	
  
 •  Build	
  a	
  query	
  in	
  seconds;	
  Get	
  the	
  result	
  in	
  seconds	
  
 •  Need	
  sta?s?cs	
  in	
  different	
  ways:	
  
      –  Live	
  
      –  Online	
  (and	
  some?mes	
  user	
  visible)	
  
      –  Whole	
  of	
  set	
  and	
  par?al	
  set,	
  but	
  based	
  on	
  Big	
  Data	
  
 •  Slice	
  and	
  dice	
  in	
  more	
  ways	
  without	
  effort	
  
 	
  

                                                                                                6	
  
Couchbase	
  Background	
  Stats	
  

 •  Couchbase	
  1.8	
  already	
  hits	
  interes?ng	
  numbers	
  
 •  Draw	
  Something	
  (OMGPOP),	
  within	
  6	
  weeks:	
  
     –  15	
  million	
  daily	
  ac?ve	
  users	
  	
  
     –  3000	
  drawings	
  generated	
  every	
  two	
  seconds	
  
     –  Over	
  two	
  billion	
  stored	
  drawings	
  
     –  90	
  nodes	
  
     –  3	
  clusters	
  
     –  No	
  stops!	
  




                                                                       7	
  
The	
  new	
  drug	
  

 •    Couchbase	
  Server	
  2.0	
  
 •    Cluster-­‐based	
  database	
  
 •    Fast,	
  Scalable,	
  Predictable	
  
 •    Map/Reduce	
  based	
  querying	
  
 •    JavaScript/Web-­‐based	
  interface	
  
      –  Type	
  in	
  your	
  query,	
  get	
  your	
  results	
  
 •  Instant	
  Gra?fica?on!	
  




                                                                      8	
  
The	
  Data	
  End	
  

 •  Store	
  data	
  however	
  you	
  want	
  
 •  The	
  Map	
  will	
  sort	
  it	
  out	
  for	
  us	
  




                                                               9	
  
Map	
  func?on	
  creates	
  matrices	
  




                                            10	
  
Map/Reduce	
  Creates	
  Indexes	
  

 •    Not	
  Hadoop	
  
 •    Map/Reduce	
  creates	
  an	
  index	
  
 •    Map	
  *AND*	
  Reduce	
  output	
  are	
  stored	
  
 •    Index	
  is	
  used	
  for	
  queries	
  
 •    Makes	
  queries	
  faster	
  (obviously!)	
  
 •    Index	
  is	
  ‘materialized’	
  at	
  query	
  ?me	
  
       –  Updated,	
  not	
  recreated	
  
 •  Incremental	
  map/reduce	
  



                                                                11	
  
Reduce	
  is	
  where	
  it	
  gets	
  interes?ng	
  




                                                        12	
  
Reduce	
  

 •  Reduce	
  summarizes	
  data	
  
 •  Built-­‐in	
  func?ons	
  
    –  _sum	
  
    –  _count	
  
    –  _stats	
  
        {!
              "value" : {!
                  "count" : 3,!
                  "min" : 5000,!
                  "sumsqr" : 594000000,!
                  "max" : 20000,!
                  "sum" : 38000!
              },!
              "key" : [!
                  "James"!
              ]!
        },!                                13	
  
Incremental	
  reduce	
  is	
  where	
  it	
  gets	
  interes?ng	
  




                                                                       14	
  
Incremental	
  Reduce	
  

 •  Required	
  at	
  two	
  levels	
  
     –  During	
  cluster-­‐based	
  queries	
  



     	
  
     –  During	
  index	
  updates	
  
 •  Incremental	
  reduce	
  requires	
  prepara?on	
  
 •  Reduce	
  func?ons	
  must	
  be	
  able	
  to	
  consume	
  their	
  own	
  
    output	
  
 •  Roll-­‐your-­‐own	
  only	
  
     –  No	
  external	
  libraries	
  
                                                                                    15	
  
Tips	
  for	
  incremental	
  

 •  Use	
  simple	
  values	
  when	
  possible	
  
 •  Use	
  complex	
  (JSON)	
  structures	
  
     –  Allows	
  for	
  more	
  incremental	
  structure	
  
     –  Store	
  the	
  ‘current’	
  result	
  
     –  Store	
  the	
  informa?on	
  needed	
  for	
  the	
  incremental	
  result	
  
 •  Iden?fy	
  rereduce:	
  
     –  func?on(key,	
  value,	
  rereduce)	
  {}	
  




                                                                                          16	
  
Simple	
  reduce	
  (incremental	
  average)	
  

 function(key, values, rereduce) {!
    var result = {total: 0, count: 0};!
    for(i=0; i < values.length; i++) {!
      if(rereduce) {
          result.total = result.total + values[i].total;
          result.count = result.count + values[i].count;
      } else {
          result.total = sum(values);
          result.count = values.length;
      }
    }
    return(result); !
 }!




                                                       17	
  
Combining	
  Reduce	
  with	
  Complex	
  Keys	
  

 •  Example:	
  logging	
  data	
  with	
  date?me	
  
 •  Explode	
  the	
  date:	
  
     –  [	
  year	
  ,	
  month,	
  day,	
  hour,	
  minute]	
  
 •  Now	
  you	
  can	
  query:	
  
     –  Single	
  Date:	
  [2012,	
  9,	
  19]	
  
     –  Mul?ple	
  Dates:	
  [	
  [	
  2012,	
  9,	
  19],	
  [2012,	
  9,	
  10]	
  ]	
  	
  
     –  Range	
  (hours)	
  [2012,	
  9,	
  0,	
  9,	
  0]	
  –	
  [2012,	
  9,	
  30,	
  21,	
  0]	
  
     –  Range	
  (days)	
  [	
  2012,	
  1,	
  1]	
  –	
  [2012,	
  9,	
  19]	
  
     –  Range	
  (months)	
  [	
  2009,	
  9]	
  –	
  [2012,3]	
  
 •  And	
  you	
  can	
  calculate	
  aggregate	
  sta?s?cs	
  

                                                                                                          18	
  
Complex	
  reduce	
  

 function(key, data, rereduce) {!
    var response = {"warning" : 0, "error": 0, "fatal" : 0 };!
    for(i=0; i<data.length; i++) {!
       if (rereduce) {!
          response.warning = response.warning + data.warning;!
          response.error = response.error + data.error;!
          response.fatal = response.fatal + data.fatal;!
       } else {!
          if (data[i] == "warning") {!
             response.warning++;!
          }!
          if (data[i] == "error" ) {!
             response.error++;!
          }!
          if (data[i] == "fatal" ) {!
             response.error++;!
          }!
       }!
    }!
    return response;!
 }!
                                                               19	
  
Complex	
  reduce	
  output	
  

 {"rows":[
 {"key":[2010,7], "value":{"warning":4,"error":2,"fatal":0}},
 {"key":[2010,8], "value":{"warning":4,"error":3,"fatal":0}},
 {"key":[2010,9], "value":{"warning":4,"error":6,"fatal":0}},
 {"key":[2010,10],"value":{"warning":7,"error":6,"fatal":0}},
 {"key":[2010,11],"value":{"warning":5,"error":8,"fatal":0}},
 {"key":[2010,12],"value":{"warning":2,"error":2,"fatal":0}},
 {"key":[2011,1], "value":{"warning":5,"error":1,"fatal":0}},
 {"key":[2011,2], "value":{"warning":3,"error":5,"fatal":0}},
 {"key":[2011,3], "value":{"warning":4,"error":4,"fatal":0}},
 {"key":[2011,4], "value":{"warning":3,"error":6,"fatal":0}}
 ]
 } !




                                                            20	
  
Why	
  is	
  the	
  excitement	
  back?	
  

 •  Data	
  in	
  is	
  easy;	
  no	
  schema,	
  no	
  formavng,	
  no	
  updates	
  
 •  Data	
  out	
  is	
  about	
  the	
  stats	
  
       –  Not	
  how	
  we	
  are	
  going	
  to	
  produce	
  them	
  
 •    Queries	
  are	
  live	
  
 •    Tweaks	
  and	
  updates	
  and	
  extensions	
  are	
  live	
  
 •    Mul?ple	
  views,	
  mul?ple	
  queries	
  
 •    Reduce	
  is	
  op?onal	
  (raw	
  data)	
  
 •    Massive	
  datasets	
  are	
  not	
  a	
  problem	
  



                                                                                         21	
  
Q&A	
  




          22	
  

Weitere ähnliche Inhalte

Was ist angesagt?

Performance Schema in MySQL (Danil Zburivsky)
Performance Schema in MySQL (Danil Zburivsky)Performance Schema in MySQL (Danil Zburivsky)
Performance Schema in MySQL (Danil Zburivsky)Ontico
 
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...Romeo Kienzler
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemMahabubur Rahaman
 
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBuilding a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBradford Stephens
 
Amebaサービスのログ解析基盤
Amebaサービスのログ解析基盤Amebaサービスのログ解析基盤
Amebaサービスのログ解析基盤Toshihiro Suzuki
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQueryCsaba Toth
 
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterThe Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterMilind Bhandarkar
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalabilityWANdisco Plc
 
Hadoop tools with Examples
Hadoop tools with ExamplesHadoop tools with Examples
Hadoop tools with ExamplesJoe McTee
 
February 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesFebruary 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesYahoo Developer Network
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Adam Kawa
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop IntroductionDzung Nguyen
 
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesHadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesMithun Radhakrishnan
 
MySQL Performance Tuning
MySQL Performance TuningMySQL Performance Tuning
MySQL Performance TuningFromDual GmbH
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & HadoopEdureka!
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overviewharithakannan
 

Was ist angesagt? (20)

Intro To Hadoop
Intro To HadoopIntro To Hadoop
Intro To Hadoop
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Performance Schema in MySQL (Danil Zburivsky)
Performance Schema in MySQL (Danil Zburivsky)Performance Schema in MySQL (Danil Zburivsky)
Performance Schema in MySQL (Danil Zburivsky)
 
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBuilding a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
 
Amebaサービスのログ解析基盤
Amebaサービスのログ解析基盤Amebaサービスのログ解析基盤
Amebaサービスのログ解析基盤
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQuery
 
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterThe Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
 
Hadoop tools with Examples
Hadoop tools with ExamplesHadoop tools with Examples
Hadoop tools with Examples
 
February 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesFebruary 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and Insides
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
 
Apache Hadoop at 10
Apache Hadoop at 10Apache Hadoop at 10
Apache Hadoop at 10
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
 
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesHadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
 
MySQL Performance Tuning
MySQL Performance TuningMySQL Performance Tuning
MySQL Performance Tuning
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 

Andere mochten auch

(Inter)national Facades: Integral Facade Design (MSc3 project) by Charlotte H...
(Inter)national Facades: Integral Facade Design (MSc3 project) by Charlotte H...(Inter)national Facades: Integral Facade Design (MSc3 project) by Charlotte H...
(Inter)national Facades: Integral Facade Design (MSc3 project) by Charlotte H...Jasper Moelker
 
Plooto - Next Generation Payment Processing Security
Plooto - Next Generation Payment Processing SecurityPlooto - Next Generation Payment Processing Security
Plooto - Next Generation Payment Processing SecurityPlooto
 
P4 Architecture (Panels (png version)) by Jasper Moelker
P4 Architecture (Panels (png version)) by Jasper MoelkerP4 Architecture (Panels (png version)) by Jasper Moelker
P4 Architecture (Panels (png version)) by Jasper MoelkerJasper Moelker
 
Sound effect manipulation word 5
Sound effect manipulation word 5Sound effect manipulation word 5
Sound effect manipulation word 5halo4robo
 
La desigual distribución de la población
La desigual distribución de la poblaciónLa desigual distribución de la población
La desigual distribución de la poblaciónAbraham Galindo Manning
 
Endlich wieder Messe - Teil 4: So funktioniert Ihr neues Messegespräch
Endlich wieder Messe - Teil 4: So funktioniert Ihr neues MessegesprächEndlich wieder Messe - Teil 4: So funktioniert Ihr neues Messegespräch
Endlich wieder Messe - Teil 4: So funktioniert Ihr neues MessegesprächMarkus Deixler-Wimmer
 
Leads facade- Design Develope Deliver
Leads facade- Design Develope DeliverLeads facade- Design Develope Deliver
Leads facade- Design Develope DeliverLeads Facade
 
Machine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureMachine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureData Science London
 
Photoshoot and photoshop
Photoshoot and photoshopPhotoshoot and photoshop
Photoshoot and photoshopniamhbarrett
 
Word Association Test by ISSB Guideline
Word Association Test by ISSB GuidelineWord Association Test by ISSB Guideline
Word Association Test by ISSB GuidelineISSBGuideline
 
Smart Hanger Based on Arduino Uno
Smart Hanger Based on Arduino UnoSmart Hanger Based on Arduino Uno
Smart Hanger Based on Arduino Unomugia_islami
 

Andere mochten auch (18)

Queens University Project Showreel
Queens University Project ShowreelQueens University Project Showreel
Queens University Project Showreel
 
Resume 2015
Resume 2015Resume 2015
Resume 2015
 
Zak World of Facades, Chennai, 14th June @ Taj Coromandel
Zak World of Facades, Chennai, 14th June @ Taj CoromandelZak World of Facades, Chennai, 14th June @ Taj Coromandel
Zak World of Facades, Chennai, 14th June @ Taj Coromandel
 
(Inter)national Facades: Integral Facade Design (MSc3 project) by Charlotte H...
(Inter)national Facades: Integral Facade Design (MSc3 project) by Charlotte H...(Inter)national Facades: Integral Facade Design (MSc3 project) by Charlotte H...
(Inter)national Facades: Integral Facade Design (MSc3 project) by Charlotte H...
 
Plooto - Next Generation Payment Processing Security
Plooto - Next Generation Payment Processing SecurityPlooto - Next Generation Payment Processing Security
Plooto - Next Generation Payment Processing Security
 
IU-LT 002
IU-LT 002IU-LT 002
IU-LT 002
 
P4 Architecture (Panels (png version)) by Jasper Moelker
P4 Architecture (Panels (png version)) by Jasper MoelkerP4 Architecture (Panels (png version)) by Jasper Moelker
P4 Architecture (Panels (png version)) by Jasper Moelker
 
Introduction to Arduino
Introduction to ArduinoIntroduction to Arduino
Introduction to Arduino
 
Sound effect manipulation word 5
Sound effect manipulation word 5Sound effect manipulation word 5
Sound effect manipulation word 5
 
La desigual distribución de la población
La desigual distribución de la poblaciónLa desigual distribución de la población
La desigual distribución de la población
 
Endlich wieder Messe - Teil 4: So funktioniert Ihr neues Messegespräch
Endlich wieder Messe - Teil 4: So funktioniert Ihr neues MessegesprächEndlich wieder Messe - Teil 4: So funktioniert Ihr neues Messegespräch
Endlich wieder Messe - Teil 4: So funktioniert Ihr neues Messegespräch
 
Acme Competition
Acme CompetitionAcme Competition
Acme Competition
 
Leads facade- Design Develope Deliver
Leads facade- Design Develope DeliverLeads facade- Design Develope Deliver
Leads facade- Design Develope Deliver
 
Cyber Crime Investigation
Cyber Crime InvestigationCyber Crime Investigation
Cyber Crime Investigation
 
Machine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureMachine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and Future
 
Photoshoot and photoshop
Photoshoot and photoshopPhotoshoot and photoshop
Photoshoot and photoshop
 
Word Association Test by ISSB Guideline
Word Association Test by ISSB GuidelineWord Association Test by ISSB Guideline
Word Association Test by ISSB Guideline
 
Smart Hanger Based on Arduino Uno
Smart Hanger Based on Arduino UnoSmart Hanger Based on Arduino Uno
Smart Hanger Based on Arduino Uno
 

Ähnlich wie Bringing back the excitement to data analysis

Buildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbBuildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbMongoDB APAC
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapSrinath Perera
 
Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB MongoDB
 
Black friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchBlack friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchSylvain Wallez
 
Pldc2012 monitoring-and-trending-with-mysql
Pldc2012 monitoring-and-trending-with-mysqlPldc2012 monitoring-and-trending-with-mysql
Pldc2012 monitoring-and-trending-with-mysqlradiocats
 
MongoDB: What, why, when
MongoDB: What, why, whenMongoDB: What, why, when
MongoDB: What, why, whenEugenio Minardi
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...DataWorks Summit/Hadoop Summit
 
Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013Nathan Bijnens
 
Big Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with AzureBig Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with AzureChristos Charmatzis
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataRoi Blanco
 
Cloud connect 03 08-2011
Cloud connect 03 08-2011Cloud connect 03 08-2011
Cloud connect 03 08-2011Colin Clark
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNATomas Cervenka
 
DevNation Atlanta
DevNation AtlantaDevNation Atlanta
DevNation Atlantaboorad
 
MySQL Performance Monitoring
MySQL Performance MonitoringMySQL Performance Monitoring
MySQL Performance Monitoringspil-engineering
 
2013 CPM Conference, Nov 6th, NoSQL Capacity Planning
2013 CPM Conference, Nov 6th, NoSQL Capacity Planning2013 CPM Conference, Nov 6th, NoSQL Capacity Planning
2013 CPM Conference, Nov 6th, NoSQL Capacity Planningasya999
 
MongoDB Tick Data Presentation
MongoDB Tick Data PresentationMongoDB Tick Data Presentation
MongoDB Tick Data PresentationMongoDB
 

Ähnlich wie Bringing back the excitement to data analysis (20)

Buildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbBuildingsocialanalyticstoolwithmongodb
Buildingsocialanalyticstoolwithmongodb
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and Roadmap
 
Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB
 
Black friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchBlack friday logs - Scaling Elasticsearch
Black friday logs - Scaling Elasticsearch
 
Pldc2012 monitoring-and-trending-with-mysql
Pldc2012 monitoring-and-trending-with-mysqlPldc2012 monitoring-and-trending-with-mysql
Pldc2012 monitoring-and-trending-with-mysql
 
Wmware NoSQL
Wmware NoSQLWmware NoSQL
Wmware NoSQL
 
MongoDB: What, why, when
MongoDB: What, why, whenMongoDB: What, why, when
MongoDB: What, why, when
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
 
Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013
 
ENAR short course
ENAR short courseENAR short course
ENAR short course
 
Big Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with AzureBig Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with Azure
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Cloud connect 03 08-2011
Cloud connect 03 08-2011Cloud connect 03 08-2011
Cloud connect 03 08-2011
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
 
DevNation Atlanta
DevNation AtlantaDevNation Atlanta
DevNation Atlanta
 
Dibi Conference 2012
Dibi Conference 2012Dibi Conference 2012
Dibi Conference 2012
 
MySQL Performance Monitoring
MySQL Performance MonitoringMySQL Performance Monitoring
MySQL Performance Monitoring
 
2013 CPM Conference, Nov 6th, NoSQL Capacity Planning
2013 CPM Conference, Nov 6th, NoSQL Capacity Planning2013 CPM Conference, Nov 6th, NoSQL Capacity Planning
2013 CPM Conference, Nov 6th, NoSQL Capacity Planning
 
MongoDB Tick Data Presentation
MongoDB Tick Data PresentationMongoDB Tick Data Presentation
MongoDB Tick Data Presentation
 

Mehr von Data Science London

Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...Data Science London
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
 
Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaData Science London
 
Numpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunchingNumpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunchingData Science London
 
Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)Data Science London
 
Big Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least SquaresBig Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least SquaresData Science London
 
ACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, TodayACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, TodayData Science London
 
Beyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems DesignBeyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems DesignData Science London
 
Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?Data Science London
 
Music and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music IndustryMusic and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music IndustryData Science London
 
Scientific Article Recommendations with Mahout
Scientific Article Recommendations with MahoutScientific Article Recommendations with Mahout
Scientific Article Recommendations with MahoutData Science London
 
Super-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRSuper-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRData Science London
 
Simple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutSimple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutData Science London
 
Going Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook UsersGoing Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook UsersData Science London
 
Investigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists ToolboxInvestigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists ToolboxData Science London
 

Mehr von Data Science London (20)

Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 
Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera Impala
 
Nowcasting Business Performance
Nowcasting Business PerformanceNowcasting Business Performance
Nowcasting Business Performance
 
Numpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunchingNumpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunching
 
Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)
 
Big Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least SquaresBig Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least Squares
 
Survival Analysis of Web Users
Survival Analysis of Web UsersSurvival Analysis of Web Users
Survival Analysis of Web Users
 
ACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, TodayACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, Today
 
Beyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems DesignBeyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems Design
 
Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?
 
Data Science for Live Music
Data Science for Live MusicData Science for Live Music
Data Science for Live Music
 
Research at last.fm
Research at last.fmResearch at last.fm
Research at last.fm
 
Music and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music IndustryMusic and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music Industry
 
Scientific Article Recommendations with Mahout
Scientific Article Recommendations with MahoutScientific Article Recommendations with Mahout
Scientific Article Recommendations with Mahout
 
Super-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRSuper-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapR
 
Simple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutSimple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in Mahout
 
Going Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook UsersGoing Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook Users
 
Practical Magic with Incanter
Practical Magic with IncanterPractical Magic with Incanter
Practical Magic with Incanter
 
Investigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists ToolboxInvestigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists Toolbox
 

Bringing back the excitement to data analysis

  • 1. Bringing  the  excitement  back  to   data  analysis   MC  Brown   VP,  TechPubs  and  Educa?on   1  
  • 2. In  the  year  1992….   •  Freetext  Database  =  Document/NoSQL  Database   •  Massive  Datasets   –  19043  records!!!   –  Approx.  8k  per  record   2  
  • 3. The  Drug   •  Data  Analysis  was  ‘Exci?ng’   •  2-­‐3  days  to  write  the  analysis  program   •  Processing  would  occur  overnight   •  Sta?s?cs  required  ‘whole  set’  processing   3  
  • 4. The  Hit   •  Mornings  were  ‘the  hit’       •  The  joy  of  real  data  analysis  is  the   output  of  a  good  report   •  Get  good  stats   –  I  know  how  many  teachers  teach  Geography  in  Scotland!   –  I  know  400  people  have  purchased  our  History  so]ware!   •  The  wait  and  the  results  kept  us  working   4  
  • 5. In  the  year  2002   •  Grid  compu?ng  was  the  drug   •  Building  200-­‐2000  node  grid  systems   •  Analysis  could  happen  the  same  day   •  Datasets  could  be  huge   –  They  just  took  more  hours   •  S?ll  working  on  en?re  datasets   –  Sta?s?cs  s?ll  required  whole  set  process   •  Jobs  became  monotonous   •  More  about  construc?on  and  technology  than  stats     5  
  • 6. In  the  year  2012   •  Need  info  and  sta?s?cs  quicker  than  ever   •  Database  clusters  provide  the  backbone   –  Grids  without  the  headache   •  Build  a  query  in  seconds;  Get  the  result  in  seconds   •  Need  sta?s?cs  in  different  ways:   –  Live   –  Online  (and  some?mes  user  visible)   –  Whole  of  set  and  par?al  set,  but  based  on  Big  Data   •  Slice  and  dice  in  more  ways  without  effort     6  
  • 7. Couchbase  Background  Stats   •  Couchbase  1.8  already  hits  interes?ng  numbers   •  Draw  Something  (OMGPOP),  within  6  weeks:   –  15  million  daily  ac?ve  users     –  3000  drawings  generated  every  two  seconds   –  Over  two  billion  stored  drawings   –  90  nodes   –  3  clusters   –  No  stops!   7  
  • 8. The  new  drug   •  Couchbase  Server  2.0   •  Cluster-­‐based  database   •  Fast,  Scalable,  Predictable   •  Map/Reduce  based  querying   •  JavaScript/Web-­‐based  interface   –  Type  in  your  query,  get  your  results   •  Instant  Gra?fica?on!   8  
  • 9. The  Data  End   •  Store  data  however  you  want   •  The  Map  will  sort  it  out  for  us   9  
  • 10. Map  func?on  creates  matrices   10  
  • 11. Map/Reduce  Creates  Indexes   •  Not  Hadoop   •  Map/Reduce  creates  an  index   •  Map  *AND*  Reduce  output  are  stored   •  Index  is  used  for  queries   •  Makes  queries  faster  (obviously!)   •  Index  is  ‘materialized’  at  query  ?me   –  Updated,  not  recreated   •  Incremental  map/reduce   11  
  • 12. Reduce  is  where  it  gets  interes?ng   12  
  • 13. Reduce   •  Reduce  summarizes  data   •  Built-­‐in  func?ons   –  _sum   –  _count   –  _stats   {! "value" : {! "count" : 3,! "min" : 5000,! "sumsqr" : 594000000,! "max" : 20000,! "sum" : 38000! },! "key" : [! "James"! ]! },! 13  
  • 14. Incremental  reduce  is  where  it  gets  interes?ng   14  
  • 15. Incremental  Reduce   •  Required  at  two  levels   –  During  cluster-­‐based  queries     –  During  index  updates   •  Incremental  reduce  requires  prepara?on   •  Reduce  func?ons  must  be  able  to  consume  their  own   output   •  Roll-­‐your-­‐own  only   –  No  external  libraries   15  
  • 16. Tips  for  incremental   •  Use  simple  values  when  possible   •  Use  complex  (JSON)  structures   –  Allows  for  more  incremental  structure   –  Store  the  ‘current’  result   –  Store  the  informa?on  needed  for  the  incremental  result   •  Iden?fy  rereduce:   –  func?on(key,  value,  rereduce)  {}   16  
  • 17. Simple  reduce  (incremental  average)   function(key, values, rereduce) {! var result = {total: 0, count: 0};! for(i=0; i < values.length; i++) {! if(rereduce) { result.total = result.total + values[i].total; result.count = result.count + values[i].count; } else { result.total = sum(values); result.count = values.length; } } return(result); ! }! 17  
  • 18. Combining  Reduce  with  Complex  Keys   •  Example:  logging  data  with  date?me   •  Explode  the  date:   –  [  year  ,  month,  day,  hour,  minute]   •  Now  you  can  query:   –  Single  Date:  [2012,  9,  19]   –  Mul?ple  Dates:  [  [  2012,  9,  19],  [2012,  9,  10]  ]     –  Range  (hours)  [2012,  9,  0,  9,  0]  –  [2012,  9,  30,  21,  0]   –  Range  (days)  [  2012,  1,  1]  –  [2012,  9,  19]   –  Range  (months)  [  2009,  9]  –  [2012,3]   •  And  you  can  calculate  aggregate  sta?s?cs   18  
  • 19. Complex  reduce   function(key, data, rereduce) {! var response = {"warning" : 0, "error": 0, "fatal" : 0 };! for(i=0; i<data.length; i++) {! if (rereduce) {! response.warning = response.warning + data.warning;! response.error = response.error + data.error;! response.fatal = response.fatal + data.fatal;! } else {! if (data[i] == "warning") {! response.warning++;! }! if (data[i] == "error" ) {! response.error++;! }! if (data[i] == "fatal" ) {! response.error++;! }! }! }! return response;! }! 19  
  • 20. Complex  reduce  output   {"rows":[ {"key":[2010,7], "value":{"warning":4,"error":2,"fatal":0}}, {"key":[2010,8], "value":{"warning":4,"error":3,"fatal":0}}, {"key":[2010,9], "value":{"warning":4,"error":6,"fatal":0}}, {"key":[2010,10],"value":{"warning":7,"error":6,"fatal":0}}, {"key":[2010,11],"value":{"warning":5,"error":8,"fatal":0}}, {"key":[2010,12],"value":{"warning":2,"error":2,"fatal":0}}, {"key":[2011,1], "value":{"warning":5,"error":1,"fatal":0}}, {"key":[2011,2], "value":{"warning":3,"error":5,"fatal":0}}, {"key":[2011,3], "value":{"warning":4,"error":4,"fatal":0}}, {"key":[2011,4], "value":{"warning":3,"error":6,"fatal":0}} ] } ! 20  
  • 21. Why  is  the  excitement  back?   •  Data  in  is  easy;  no  schema,  no  formavng,  no  updates   •  Data  out  is  about  the  stats   –  Not  how  we  are  going  to  produce  them   •  Queries  are  live   •  Tweaks  and  updates  and  extensions  are  live   •  Mul?ple  views,  mul?ple  queries   •  Reduce  is  op?onal  (raw  data)   •  Massive  datasets  are  not  a  problem   21  
  • 22. Q&A   22