SlideShare a Scribd company logo
1 of 35
Download to read offline
NEARING THE EVENT HORIZON.
HADOOP WAS PREDICTABLE, WHAT’S NEXT?




            May 23, 2012       Mike Miller
                              mike@cloudant.com
                                @mlmilleratmit
What I Am

    Cloudant Founder, Chief Scientist
    (we’re hiring at all positions)

    Affiliate Assistant Professor, Particle Physics(UW)

    Background: machine learning, analysis, big data,
    globally distributed systems




Mike Miller, GlueCon May 2012                           2
What I Am




                                A CDN for your Application Data
Mike Miller, GlueCon May 2012                                     3
What I Am Not


                                didn’t see these coming
                                Super luminal neutrinos
                                Red Sox epic collapse in September
                                Red Wings losing in the first round
                                ...

                                But here I go anyway




Mike Miller, GlueCon May 2012                                        4
My First Postulate of Big-Data

                                     Google Matters

           What matters for google...
           ... matters for the internet...
           ...and therefore matters for the enterprise...
           ... will therefore be re-architected by Apache...
           ... and therefore matters to you.




Mike Miller, GlueCon May 2012                                  5
Evidence




               Business Week, 12/24/2007




Mike Miller, GlueCon May 2012              6
Evidence




               Business Week, 12/24/2007




Mike Miller, GlueCon May 2012              6
Evidence




               Business Week, 12/24/2007




Mike Miller, GlueCon May 2012              6
The Old Canon
         • Google File System (the important one)
           http://labs.google.com/papers/gfs.html

         • MapReduce (the big one)
           http://labs.google.com/papers/mapreduce.html

         • BigTable (clone me!)
           http://labs.google.com/papers/bigtable.html

         • Dynamo (ok, AWS. but masterless quorum)
           http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf



                                copy these. use these. print $$$
Mike Miller, GlueCon May 2012                                                             7
MapReduce: The Awesome
         • Approachable interface
           “What do I do with a single piece of data?”

         • Data Parallel
           Developers can basically forget about scatter-gather

         • Fault Tolerant
           Failure at scale is the norm!
           Protects both user and system operator

         • IO Optimized
           Built for sequential IO
           commodity disks spinning forward at O(20 MB/sec) each




Mike Miller, GlueCon May 2012                                      8
So... is that it?




   http://gigaom.com/cloud/democratizing-big-data-is-hadoop-our-only-hope/




Mike Miller, GlueCon May 2012                                                9
So... is that it?




   http://gigaom.com/cloud/democratizing-big-data-is-hadoop-our-only-hope/




                                                  http://gigaom.com/cloud/what-it-really-means-when-someone-says-hadoop/




Mike Miller, GlueCon May 2012                                                                                              9
So... is that it?




   http://gigaom.com/cloud/democratizing-big-data-is-hadoop-our-only-hope/




                                                  http://gigaom.com/cloud/what-it-really-means-when-someone-says-hadoop/


                                                                                      http://mackiemathew.com/2012/02/25/the-problems-in-hadoop-when-does-it-fail-to-deliver/
Mike Miller, GlueCon May 2012                                                                                                                                            9
MapReduce: The not so Awesome
         • Hadoop doesn’t power big data applications
           Not a transactional datastore. Slosh back and forth via ETL

         • Processing latency
           Non-incremental, must re-slurp entire dataset every pass

         • Ad-Hoc queries
           Bare metal interface, data import

         • Graphs
           Only a handful of graph problems amenable to MR
           http://www.computer.org/portal/web/csdl/doi/10.1109/MCSE.2009.120




Mike Miller, GlueCon May 2012                                                  10
To the Event Horizon




Mike Miller, GlueCon May 2012                          11
Enter The New Canon
         • Percolator
           incremental processing
           http://research.google.com/pubs/pub36726.html

         • Dremel
           ad-hoc analysis queries
           http://research.google.com/pubs/pub36632.html

         • Pregel
           Big graphs
           http://dl.acm.org/citation.cfm?id=1807184


                                Scalable, Fault Tolerant, Approachable

Mike Miller, GlueCon May 2012                                            12
Percolator




Mike Miller, GlueCon May 2012   13
Percolator: incremental processing
         • Replaced MapReduce as the tool to build search index
           “However, reprocessing the entire web discards the work done in earlier runs and makes latency
           proportional to the size of the repository, rather than the size of the update.”

         • Bigtable alone can’t do it
           “BigTable scales...but doesn’t provide tools to help programmers maintain data invariants in the
           face of concurrent updates.”

         • Applicability
           Incrementally updating data
           Computational output can be broken down into small pieces
           Computation large in some dimension (data size, cpu, etc)

         • Does it matter?
           “...Converting the indexing system to an incremental system ... reduced the averaging document
           processing latency by a factor of 100...”


Mike Miller, GlueCon May 2012                                                                                 14
Percolator: incremental processing
  • BigTable plus...
    Multi-row ACID Transactions
    snapshot isolation, lazy locks
    up to 10s write latencies

    Timestamps

    Notifications                                        Start Timestamp (read)
    Do not maintain invariants
                                                        Commit Timestamp (write)
    Observer Framework
    your code to be run upon notification of an update


Mike Miller, GlueCon May 2012                                                      15
Percolator: incremental processing




                                Near Linear Scaling to 15k Cores
Mike Miller, GlueCon May 2012                                      16
Percolator: incremental processing




                                Latency lower than MapReduce by 100x
Mike Miller, GlueCon May 2012                                          17
Dremel




Mike Miller, GlueCon May 2012   18
Dremel: ad-hoc Query
         • Scalable, interactive ad-hoc query system for read-only nested data
           “...capable of running aggregation queries over trillion-row tables in seconds.”

         • ... on nested data structures in situ
           Web and scientific data is often non-relational
           nested data (protobuffs) underlies most structured data at Google

         • Usage
           DEFINE TABLE t AS /path/to/data/*
           SELECT TOP(signal1,100), COUNT(*) FROM t

         • Applicability
           Analysis of crawled documents
           Tracking of install data for apps on Android Market
           Crash reports
           Spam analysis...

                                                      Dream BI Tool
Mike Miller, GlueCon May 2012                                                                 19
Dremel: ad-hoc Query
 • Ingredients
   In situ data
   SQL like interface
   Serving trees for query execution
   Column striped data (3-10x)
   Analysis Catalogs




Mike Miller, GlueCon May 2012          20
Dremel: ad-hoc Query




                                Columns ~10x faster than Records   21
Mike Miller, GlueCon May 2012
Dremel: ad-hoc Query



                Benchmark Data   MapReduce (via Sawzall)




                                       Dremel (via SQL)

Mike Miller, GlueCon May 2012                              22
Dremel: ad-hoc Query



                                     Significant Optimization Possible


 Dremel ~100x Faster than Stock MR




Mike Miller, GlueCon May 2012                                           23
Dremel: ad-hoc Query




                          Most Production Queries Executed in <10 seconds

Mike Miller, GlueCon May 2012                                               24
Pregel




Mike Miller, GlueCon May 2012   25
Pregel: Big Graphs
         • Massively parallel processing of big graphs
           billions of vertices, trillions of edges

         • Bulk synchronous parallel model
           sequence of vertex oriented iterations
           send/receive messages from other vertex computations
           read/modify state of vertex, outgoing edges, graph topology

         • Expressive, easy to program
           distribution details hidden behind abstract API

         • Iterative
           computation continues until each vertex votes to terminate

         • In production
           PageRank 15 lines of code


Mike Miller, GlueCon May 2012                                            26
Pregel: Big Graphs
  • Master “Name” node
    connects processes for messaging

  • Message Passing
    no remote procedures, reads

  • Graph hashed across nodes
    vertex, outgoing edges stored in RAM

  • Aggregators
    global mechanism for aggregation
    all but final reduce computed on node local data

  • Checkpointing
    configurable, enables automatic recovery


Mike Miller, GlueCon May 2012                         27
Pregel: Big Graphs




Mike Miller, GlueCon May 2012   28
Pregel: Big Graphs




                                Near Linear Scaling to 1B nodes
Mike Miller, GlueCon May 2012                                     29
Learn More
         • Incremental Processing
           Incremental, in-database map/reduce in Cloudant’s BigCouch
           HBase 0.92 supports observers/coprocessors
           Stream processing via Storm, HStreaming, etc.

         • Ad Hoc Query
           Google BigQuery
           Column stores (Vertica, etc)
           OpenDremel (stalled?)
           ?

         • Big Graphs
           Giraph on Hadoop (Apache Incubator)
           Golden Orb (stalled?)


Mike Miller, GlueCon May 2012                                           30
Lessons Learned


 • Hire Jeff Dean and Sanjay Ghemawat
 • GFS enables everything
 • There is massive opportunity on the horizon




Mike Miller, GlueCon May 2012                    31

More Related Content

Similar to Gluecon miller horizon

How Google Does Big Data - DevNexus 2014
How Google Does Big Data - DevNexus 2014How Google Does Big Data - DevNexus 2014
How Google Does Big Data - DevNexus 2014
James Chittenden
 
Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.
Alexandru Iosup
 

Similar to Gluecon miller horizon (20)

How Google Does Big Data - DevNexus 2014
How Google Does Big Data - DevNexus 2014How Google Does Big Data - DevNexus 2014
How Google Does Big Data - DevNexus 2014
 
Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...
Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...
Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...
 
Scalability 09262012
Scalability 09262012Scalability 09262012
Scalability 09262012
 
Microgroove (GoGrid Customer) Presentation at Cloud Connect 2012
Microgroove (GoGrid Customer) Presentation at Cloud Connect 2012Microgroove (GoGrid Customer) Presentation at Cloud Connect 2012
Microgroove (GoGrid Customer) Presentation at Cloud Connect 2012
 
Fundamentals of Big Data
Fundamentals of Big DataFundamentals of Big Data
Fundamentals of Big Data
 
IaaS Cloud Benchmarking: Approaches, Challenges, and Experience
IaaS Cloud Benchmarking: Approaches, Challenges, and ExperienceIaaS Cloud Benchmarking: Approaches, Challenges, and Experience
IaaS Cloud Benchmarking: Approaches, Challenges, and Experience
 
Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)
 
CloudCamp
CloudCampCloudCamp
CloudCamp
 
A Survey of NGS Data Analysis on Hadoop
A Survey of NGS Data Analysis on HadoopA Survey of NGS Data Analysis on Hadoop
A Survey of NGS Data Analysis on Hadoop
 
Notes on data-intensive processing with Hadoop Mapreduce
Notes on data-intensive processing with Hadoop MapreduceNotes on data-intensive processing with Hadoop Mapreduce
Notes on data-intensive processing with Hadoop Mapreduce
 
Hadoop.mapreduce
Hadoop.mapreduceHadoop.mapreduce
Hadoop.mapreduce
 
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...
 
Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016
Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016
Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016
 
Big data business case
Big data   business caseBig data   business case
Big data business case
 
Big data - teams not technology
Big data - teams not technologyBig data - teams not technology
Big data - teams not technology
 
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
 
BigData Meets the Federal Data Center
BigData Meets the Federal Data CenterBigData Meets the Federal Data Center
BigData Meets the Federal Data Center
 
Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.
 
Big data with hadoop
Big data with hadoopBig data with hadoop
Big data with hadoop
 
Hadoop Fundamentals I
Hadoop Fundamentals IHadoop Fundamentals I
Hadoop Fundamentals I
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 

Gluecon miller horizon

  • 1. NEARING THE EVENT HORIZON. HADOOP WAS PREDICTABLE, WHAT’S NEXT? May 23, 2012 Mike Miller mike@cloudant.com @mlmilleratmit
  • 2. What I Am Cloudant Founder, Chief Scientist (we’re hiring at all positions) Affiliate Assistant Professor, Particle Physics(UW) Background: machine learning, analysis, big data, globally distributed systems Mike Miller, GlueCon May 2012 2
  • 3. What I Am A CDN for your Application Data Mike Miller, GlueCon May 2012 3
  • 4. What I Am Not didn’t see these coming Super luminal neutrinos Red Sox epic collapse in September Red Wings losing in the first round ... But here I go anyway Mike Miller, GlueCon May 2012 4
  • 5. My First Postulate of Big-Data Google Matters What matters for google... ... matters for the internet... ...and therefore matters for the enterprise... ... will therefore be re-architected by Apache... ... and therefore matters to you. Mike Miller, GlueCon May 2012 5
  • 6. Evidence Business Week, 12/24/2007 Mike Miller, GlueCon May 2012 6
  • 7. Evidence Business Week, 12/24/2007 Mike Miller, GlueCon May 2012 6
  • 8. Evidence Business Week, 12/24/2007 Mike Miller, GlueCon May 2012 6
  • 9. The Old Canon • Google File System (the important one) http://labs.google.com/papers/gfs.html • MapReduce (the big one) http://labs.google.com/papers/mapreduce.html • BigTable (clone me!) http://labs.google.com/papers/bigtable.html • Dynamo (ok, AWS. but masterless quorum) http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf copy these. use these. print $$$ Mike Miller, GlueCon May 2012 7
  • 10. MapReduce: The Awesome • Approachable interface “What do I do with a single piece of data?” • Data Parallel Developers can basically forget about scatter-gather • Fault Tolerant Failure at scale is the norm! Protects both user and system operator • IO Optimized Built for sequential IO commodity disks spinning forward at O(20 MB/sec) each Mike Miller, GlueCon May 2012 8
  • 11. So... is that it? http://gigaom.com/cloud/democratizing-big-data-is-hadoop-our-only-hope/ Mike Miller, GlueCon May 2012 9
  • 12. So... is that it? http://gigaom.com/cloud/democratizing-big-data-is-hadoop-our-only-hope/ http://gigaom.com/cloud/what-it-really-means-when-someone-says-hadoop/ Mike Miller, GlueCon May 2012 9
  • 13. So... is that it? http://gigaom.com/cloud/democratizing-big-data-is-hadoop-our-only-hope/ http://gigaom.com/cloud/what-it-really-means-when-someone-says-hadoop/ http://mackiemathew.com/2012/02/25/the-problems-in-hadoop-when-does-it-fail-to-deliver/ Mike Miller, GlueCon May 2012 9
  • 14. MapReduce: The not so Awesome • Hadoop doesn’t power big data applications Not a transactional datastore. Slosh back and forth via ETL • Processing latency Non-incremental, must re-slurp entire dataset every pass • Ad-Hoc queries Bare metal interface, data import • Graphs Only a handful of graph problems amenable to MR http://www.computer.org/portal/web/csdl/doi/10.1109/MCSE.2009.120 Mike Miller, GlueCon May 2012 10
  • 15. To the Event Horizon Mike Miller, GlueCon May 2012 11
  • 16. Enter The New Canon • Percolator incremental processing http://research.google.com/pubs/pub36726.html • Dremel ad-hoc analysis queries http://research.google.com/pubs/pub36632.html • Pregel Big graphs http://dl.acm.org/citation.cfm?id=1807184 Scalable, Fault Tolerant, Approachable Mike Miller, GlueCon May 2012 12
  • 18. Percolator: incremental processing • Replaced MapReduce as the tool to build search index “However, reprocessing the entire web discards the work done in earlier runs and makes latency proportional to the size of the repository, rather than the size of the update.” • Bigtable alone can’t do it “BigTable scales...but doesn’t provide tools to help programmers maintain data invariants in the face of concurrent updates.” • Applicability Incrementally updating data Computational output can be broken down into small pieces Computation large in some dimension (data size, cpu, etc) • Does it matter? “...Converting the indexing system to an incremental system ... reduced the averaging document processing latency by a factor of 100...” Mike Miller, GlueCon May 2012 14
  • 19. Percolator: incremental processing • BigTable plus... Multi-row ACID Transactions snapshot isolation, lazy locks up to 10s write latencies Timestamps Notifications Start Timestamp (read) Do not maintain invariants Commit Timestamp (write) Observer Framework your code to be run upon notification of an update Mike Miller, GlueCon May 2012 15
  • 20. Percolator: incremental processing Near Linear Scaling to 15k Cores Mike Miller, GlueCon May 2012 16
  • 21. Percolator: incremental processing Latency lower than MapReduce by 100x Mike Miller, GlueCon May 2012 17
  • 23. Dremel: ad-hoc Query • Scalable, interactive ad-hoc query system for read-only nested data “...capable of running aggregation queries over trillion-row tables in seconds.” • ... on nested data structures in situ Web and scientific data is often non-relational nested data (protobuffs) underlies most structured data at Google • Usage DEFINE TABLE t AS /path/to/data/* SELECT TOP(signal1,100), COUNT(*) FROM t • Applicability Analysis of crawled documents Tracking of install data for apps on Android Market Crash reports Spam analysis... Dream BI Tool Mike Miller, GlueCon May 2012 19
  • 24. Dremel: ad-hoc Query • Ingredients In situ data SQL like interface Serving trees for query execution Column striped data (3-10x) Analysis Catalogs Mike Miller, GlueCon May 2012 20
  • 25. Dremel: ad-hoc Query Columns ~10x faster than Records 21 Mike Miller, GlueCon May 2012
  • 26. Dremel: ad-hoc Query Benchmark Data MapReduce (via Sawzall) Dremel (via SQL) Mike Miller, GlueCon May 2012 22
  • 27. Dremel: ad-hoc Query Significant Optimization Possible Dremel ~100x Faster than Stock MR Mike Miller, GlueCon May 2012 23
  • 28. Dremel: ad-hoc Query Most Production Queries Executed in <10 seconds Mike Miller, GlueCon May 2012 24
  • 30. Pregel: Big Graphs • Massively parallel processing of big graphs billions of vertices, trillions of edges • Bulk synchronous parallel model sequence of vertex oriented iterations send/receive messages from other vertex computations read/modify state of vertex, outgoing edges, graph topology • Expressive, easy to program distribution details hidden behind abstract API • Iterative computation continues until each vertex votes to terminate • In production PageRank 15 lines of code Mike Miller, GlueCon May 2012 26
  • 31. Pregel: Big Graphs • Master “Name” node connects processes for messaging • Message Passing no remote procedures, reads • Graph hashed across nodes vertex, outgoing edges stored in RAM • Aggregators global mechanism for aggregation all but final reduce computed on node local data • Checkpointing configurable, enables automatic recovery Mike Miller, GlueCon May 2012 27
  • 32. Pregel: Big Graphs Mike Miller, GlueCon May 2012 28
  • 33. Pregel: Big Graphs Near Linear Scaling to 1B nodes Mike Miller, GlueCon May 2012 29
  • 34. Learn More • Incremental Processing Incremental, in-database map/reduce in Cloudant’s BigCouch HBase 0.92 supports observers/coprocessors Stream processing via Storm, HStreaming, etc. • Ad Hoc Query Google BigQuery Column stores (Vertica, etc) OpenDremel (stalled?) ? • Big Graphs Giraph on Hadoop (Apache Incubator) Golden Orb (stalled?) Mike Miller, GlueCon May 2012 30
  • 35. Lessons Learned • Hire Jeff Dean and Sanjay Ghemawat • GFS enables everything • There is massive opportunity on the horizon Mike Miller, GlueCon May 2012 31