SlideShare ist ein Scribd-Unternehmen logo
1 von 40
Downloaden Sie, um offline zu lesen
A Modest Proposal
          for Taming and Clarifying the Promises of Big Data

                          and the Software Driven Future

                              Brendan McAdams
                                    10gen, Inc.
                                 brendan@10gen.com
                                       @rit

Friday, November 16, 12
"In short, software is eating the world."
                          - Marc Andreesen
                                Wall Street Journal, Aug. 2011
                                http://on.wsj.com/XLwnmo




Friday, November 16, 12
Software is Eating the World

     • Amazon.com (and .uk, .es, etc) started as a bookstore
              • Today, they sell just about everything - bicycles,
              appliances, computers, TVs, etc.
              • In some cities in America, they even do home grocery
              delivery
              • No longer as much of a physical goods company -
              becoming fixated and surrounded by software
              • Pioneering the eBook revolution with Kindle
              • EC2 is running a huge percentage of the public
              internet



Friday, November 16, 12
Software is Eating the World

    • Netflix started as a company to deliver DVDs to the home...




Friday, November 16, 12
Software is Eating the World


    • Netflix started as a company to deliver DVDs to the home...
             • But as they’ve grown, business has shifted to an
             online streaming service
             • They are now rolling out rapidly in many countries
             including Ireland, the UK, Canada and the Nordics
             • No need for physical inventory or postal distribution ...
             just servers and digital copies




Friday, November 16, 12
Disney Found Itself Forced To Transform...




                          From This...


Friday, November 16, 12
Disney Found Itself Forced To Transform...




                          ... To This

Friday, November 16, 12
But What Does All This Software Do?



     • Software always eats data – be it text files, user form input,
     emails, etc

     • All things that eat, must eventually excrete...




Friday, November 16, 12
Ingestion = Excretion



                            +                 =


                           Yeast Ingests Sugars,

                           and Excretes Ethanol


Friday, November 16, 12
Ingestion = Excretion



                                         =




                          Cows, er...

                          well, you get the point.
Friday, November 16, 12
So What Does Software Eat?

     • Software always eats data – be it text files, user form input,
     emails, etc

     • But what does software excrete?
              • More Data, of course...
              • This data gets bigger and bigger
              • The solutions become narrower for storing &
              processing this data
              • Data Fertilizes Software, in an endless cycle...


Friday, November 16, 12
There’s a Big Market Here...


     • Lots of Solutions for Big Data
              • Data Warehouse Software
              • Operational Databases
                          • Old style systems being upgraded to scale storage +
                          processing
                          • NoSQL - Cassandra, MongoDB, etc
              • Platforms
                          • Hadoop



Friday, November 16, 12
Don’t Tilt At Windmills...




Friday, November 16, 12
Don’t Tilt At Windmills...


     • It is easy to get distracted by all of these solutions
     • Keep it simple
              • Use tools you (and your team) can understand
              • Use tools and techniques that can scale
              • Try not to reinvent the wheel




Friday, November 16, 12
... And Don’t Bite Off More Than You Can Chew




     • Break it into smaller pieces
              • You can’t fit a whole pig into your mouth...
              • ... slice it into small parts that you can consume.

Friday, November 16, 12
Big Data at a Glance

                                     Large Dataset
                                 Primary Key as “username”




     • Big Data can be gigabytes, terabytes, petabytes or exabytes
     • An ideal big data system scales up and down around various
     data sizes – while providing a uniform view

     • Major concerns
              • Can I read & write this data efficiently at different scale?
              • Can I run calculations on large portions of this data?

Friday, November 16, 12
Big Data at a Glance
                                                                    ...
                                  Large Dataset
                              Primary Key as “username”




     • Systems like Google File System (which inspired Hadoop’s
     HDFS) and MongoDB’s Sharding handle the scale problem by
     chunking

     • Break up pieces of data into smaller chunks, spread across
     many data nodes
       • Each data node contains many chunks
       • If a chunk gets too large or a node overloaded, data can be
       rebalanced

Friday, November 16, 12
Chunks Represent Ranges of Values
                                                                                     Initially, an empty
                                                                                     collection has a single
                                                       -∞               +∞           chunk, running the range
                                                                                     of minimum (-∞) to         ...
                  INSERT {USERNAME: “Bill”}                                          maximum (+∞)


            As we add data, more
            chunks are created of             -∞       “B”               “C”         +∞
            new ranges

     INSERT {USERNAME: “Becky”}
                                                                         INSERT {USERNAME: “Brendan”}


                                                                               Individual or partial letter
                               -∞             “Ba”    “Be”          “Br”       ranges are one possible
                                                                               chunk value... but they
                                                                               can get smaller!

                                        INSERT {USERNAME: “Brad”}


                                 The smallest possible
                                 chunk value is not a               “Brad”      “Brendan”
                                 range, but a single
                                 possible value




Friday, November 16, 12
Big Data at a Glance
                 a        b         c            d        e         f       g       h
                                                                                            ...
                                                Large Dataset
                                            Primary Key as “username”
                     s        t         u            v        w         x       y       z




     • To simplify things, let’s look at our dataset split into chunks by
     letter

     • Each chunk is represented by a single letter marking its
     contents
        • You could think of “B” as really being “Ba” →”Bz”


Friday, November 16, 12
Big Data at a Glance
               a          b         c            d        e         f       g       h
                                                Large Dataset
                                            Primary Key as “username”
                   s          t         u            v        w         x       y       z




Friday, November 16, 12
Big Data at a Glance

                                                Large Dataset
                                            Primary Key as “username”




                 x        b         v             t        d            f       z       s



                     h        e         u             c        w            a       y       g



       MongoDB Sharding ( as well as HDFS ) breaks data into chunks (~64 mb)



Friday, November 16, 12
Big Data at a Glance
          Data Node 1               Data Node 2
                                            Large Dataset Node 3
                                                      Data                    Data Node 4
                                        Primary Key as “username”
           25% of chunks            25% of chunks         25% of chunks       25% of chunks




                 x        b         v         t         d         f       z         s



                     h        e         u         c         w         a       y         g



        Representing data as chunks allows many levels of scale across n data nodes



Friday, November 16, 12
Scaling
           Data Node 1               Data Node 2        Data Node 3          Data Node 4 5
                                                                               Data Node




                 x          b        v        t        d        f        z        s



                     h          e        u        c        w        a        y         g



                     The set of chunks can be evenly distributed across n data nodes



Friday, November 16, 12
Add Nodes: Chunk Rebalancing
    Data Node 1               Data Node 2   Data Node 3    Data Node 4    Data Node 5

        x             c        b      z        t     f       v      y

        a             s        u      g       e     w        h      d




                           The goal is equilibrium - an equal distribution.

                                As nodes are added (or even removed)

                              chunks can be redistributed for balance.




Friday, November 16, 12
Don’t Bite Off More Than You Can Chew...


     • The answer to calculating big data is much the same as
     storing it

     • We need to break our data into bite sized pieces
        • Build functions which can be composed together
              repeatedly on partitions of our data
              • Process portions of the data across multiple calculation
              nodes
              • Aggregate the results into a final set of results



Friday, November 16, 12
Bite Sized Pieces Are Easier to Swallow


     • These pieces are not chunks – rather, the individual data
     points that make up each chunk

     • Chunks make up a useful data transfer units for processing
     as well
        • Transfer Chunks as “Input Splits” to calculation nodes,
        allowing for scalable parallel processing




Friday, November 16, 12
MapReduce the Pieces



     • The most common application of these techniques is
     MapReduce
       • Based on a Google Whitepaper, works with two primary
       functions – map and reduce – to calculate against large
       datasets




Friday, November 16, 12
MapReduce to Calculate Big Data



     • MapReduce is designed to effectively process data at varying
     scales

     • Composable function units can be reused repeatedly for scaled
     results




Friday, November 16, 12
MapReduce to Calculate Big Data



     • In addition to the HDFS storage component, Hadoop is built
     around MapReduce for calculation

     • MongoDB can be integrated to MapReduce data on Hadoop
        • No HDFS storage needed - data moves directly between
              MongoDB and Hadoop’s MapReduce engine




Friday, November 16, 12
What is MapReduce?

     • MapReduce made up of a series of phases, the primary of
     which are
              • Map
              • Shuffle
              • Reduce
     • Let’s look at a typical MapReduce job
              • Email records
              • Count # of times a particular user has received email


Friday, November 16, 12
MapReducing Email
            to: tyler
         from: brendan
      subject: Ruby Support


            to: brendan
             from: tyler
     subject: Re: Ruby Support


            to: mike
         from: brendan
      subject: Node Support


            to: brendan
            from: mike
     subject: Re: Node Support


             to: mike
            from: tyler
      subject: COBOL Support


              to: tyler
            from: mike
     subject: Re: COBOL Support
                 (WTF?)




Friday, November 16, 12
Map Step
                                  map function breaks each document
            to: tyler
                                    into a key (grouping) & value
                                                                         key: tyler
         from: brendan                                                value: {count: 1}
      subject: Ruby Support


            to: brendan
             from: tyler                                               key: brendan
     subject: Re: Ruby Support                                        value: {count: 1}


            to: mike
         from: brendan
      subject: Node Support                                              key: tyler
                                                                      value: {count: 1}
                                             map function
            to: brendan                       emit(k, v)
            from: mike
     subject: Re: Node Support                                           key: mike
                                                                      value: {count: 1}

             to: mike
            from: tyler                                                key: brendan
      subject: COBOL Support                                          value: {count: 1}

              to: tyler
            from: mike
     subject: Re: COBOL Support                                          key: mike
                 (WTF?)                                               value: {count: 1}




Friday, November 16, 12
Group/Shuffle Step
                                                           key: tyler
                                                        value: {count: 1}




                                                         key: brendan
         Group like keys together,                      value: {count: 1}



         creating an array of their                        key: tyler
                                                        value: {count: 1}

                     distinct values
   (Automatically done by M/R frameworks)
                                                           key: mike
                                                        value: {count: 1}



                                                         key: brendan
                                                        value: {count: 1}




                                                           key: mike
                                                        value: {count: 1}




Friday, November 16, 12
Group/Shuffle Step


         Group like keys together,
                                                     key: tyler

         creating an array of their
                                                 values: [{count: 1},
                                                         {count: 1}]



                     distinct values                 key: mike
                                                 values: [{count: 1},
                                                         {count: 1}]
   (Automatically done by M/R frameworks)
                                                    key: brendan
                                                 values: [{count: 1},
                                                         {count: 1}]




Friday, November 16, 12
Reduce Step
                                       For each key reduce function

                                    flattens the list of values to a single

                                                     result
                 key: tyler                                                      key: mike
             values: [{count: 1},                                             value: {count: 2}
                     {count: 1}]


                 key: mike                                                     key: brendan
                                                  reduce function
             values: [{count: 1},                                             value: {count: 2}
                                                   aggregate values
                     {count: 1}]
                                                  return (result)

                key: brendan
                                                                                 key: tyler
             values: [{count: 1},
                                                                              value: {count: 2}
                     {count: 1}]




Friday, November 16, 12
Processing Scalable Big Data

     • MapReduce provides an effective system for calculating
     and processing our large datasets (from gigabytes through
     exabytes and beyond)

     • MapReduce is supported in many places including
     MongoDB & Hadoop

     • We have effective answers for both of our concerns.
        • Can I read & write this data efficiently at different scale?
        • Can I run calculations on large portions of this data?


Friday, November 16, 12
Batch Isn’t a Sustainable Answer
     • There are downsides here - fundamentally, MapReduce is a
     batch process

     • Batch systems like Hadoop give us a “Catch 22”
              • You can get answers to questions from Petabytes of
              Data
              • But you can’t guarantee you’ll get them quickly
     • In some ways, this is a step backwards in our industry
     • Business Stakeholders tend to want answers now
              • We must evolve

Friday, November 16, 12
Moving Away from Batch
     • The Big Data world is moving rapidly away from slow, batch
     based processing solutions

     • Google moved forward from Batch into more Realtime over last
     few years

     • Hadoop is replacing “MapReduce as Assembly Language” with
     more flexible resource management in YARN
              • Now MapReduce is just a feature implemented on top of
              YARN
              • Build anything we want
     • Newer systems like Spark & Storm provide platforms for
     realtime processes


Friday, November 16, 12
In Closing
     • The World IS Being Eaten By Software

              • All that software is leaving behind an awful lot of data
              • We must be careful not to “step in it”
              • More Data Means More Software Means More Data
              Means...

     • Practical Solutions for Processing & Storing Data will save
     us

     • We as Data Scientists & Technologists must always evolve
     our strategies, thinking and tools


Friday, November 16, 12
[Download the Hadoop Connector]
                          http://github.com/mongodb/mongo-hadoop
                                            [Docs]
                                http://api.mongodb.org/hadoop/



                                 ¿QUESTIONS?

                                     *Contact Me*
                                 brendan@10gen.com
                                     (twitter: @rit)




Friday, November 16, 12

Weitere ähnliche Inhalte

Mehr von Big Data Spain

Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...Big Data Spain
 
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Big Data Spain
 
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Big Data Spain
 
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Big Data Spain
 
State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...Big Data Spain
 
Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...Big Data Spain
 
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Big Data Spain
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...Big Data Spain
 
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Big Data Spain
 
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Big Data Spain
 
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Big Data Spain
 
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Big Data Spain
 
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...Big Data Spain
 
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Big Data Spain
 
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...Big Data Spain
 
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Big Data Spain
 
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...Big Data Spain
 
Deep reinforcement learning : Starcraft learning environment by Gema Parreño ...
Deep reinforcement learning : Starcraft learning environment by Gema Parreño ...Deep reinforcement learning : Starcraft learning environment by Gema Parreño ...
Deep reinforcement learning : Starcraft learning environment by Gema Parreño ...Big Data Spain
 
End-to-End “Exactly Once” with Heron & Pulsar by Ivan Kelly at Big Data Spain...
End-to-End “Exactly Once” with Heron & Pulsar by Ivan Kelly at Big Data Spain...End-to-End “Exactly Once” with Heron & Pulsar by Ivan Kelly at Big Data Spain...
End-to-End “Exactly Once” with Heron & Pulsar by Ivan Kelly at Big Data Spain...Big Data Spain
 
A Deep Learning use case for water end use detection by Roberto Díaz and José...
A Deep Learning use case for water end use detection by Roberto Díaz and José...A Deep Learning use case for water end use detection by Roberto Díaz and José...
A Deep Learning use case for water end use detection by Roberto Díaz and José...Big Data Spain
 

Mehr von Big Data Spain (20)

Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
 
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
 
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
 
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
 
State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...
 
Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...
 
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
 
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
 
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
 
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
 
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
 
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
 
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
 
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
 
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...
 
Deep reinforcement learning : Starcraft learning environment by Gema Parreño ...
Deep reinforcement learning : Starcraft learning environment by Gema Parreño ...Deep reinforcement learning : Starcraft learning environment by Gema Parreño ...
Deep reinforcement learning : Starcraft learning environment by Gema Parreño ...
 
End-to-End “Exactly Once” with Heron & Pulsar by Ivan Kelly at Big Data Spain...
End-to-End “Exactly Once” with Heron & Pulsar by Ivan Kelly at Big Data Spain...End-to-End “Exactly Once” with Heron & Pulsar by Ivan Kelly at Big Data Spain...
End-to-End “Exactly Once” with Heron & Pulsar by Ivan Kelly at Big Data Spain...
 
A Deep Learning use case for water end use detection by Roberto Díaz and José...
A Deep Learning use case for water end use detection by Roberto Díaz and José...A Deep Learning use case for water end use detection by Roberto Díaz and José...
A Deep Learning use case for water end use detection by Roberto Díaz and José...
 

Kürzlich hochgeladen

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 

Kürzlich hochgeladen (20)

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 

The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012

  • 1. A Modest Proposal for Taming and Clarifying the Promises of Big Data and the Software Driven Future Brendan McAdams 10gen, Inc. brendan@10gen.com @rit Friday, November 16, 12
  • 2. "In short, software is eating the world." - Marc Andreesen Wall Street Journal, Aug. 2011 http://on.wsj.com/XLwnmo Friday, November 16, 12
  • 3. Software is Eating the World • Amazon.com (and .uk, .es, etc) started as a bookstore • Today, they sell just about everything - bicycles, appliances, computers, TVs, etc. • In some cities in America, they even do home grocery delivery • No longer as much of a physical goods company - becoming fixated and surrounded by software • Pioneering the eBook revolution with Kindle • EC2 is running a huge percentage of the public internet Friday, November 16, 12
  • 4. Software is Eating the World • Netflix started as a company to deliver DVDs to the home... Friday, November 16, 12
  • 5. Software is Eating the World • Netflix started as a company to deliver DVDs to the home... • But as they’ve grown, business has shifted to an online streaming service • They are now rolling out rapidly in many countries including Ireland, the UK, Canada and the Nordics • No need for physical inventory or postal distribution ... just servers and digital copies Friday, November 16, 12
  • 6. Disney Found Itself Forced To Transform... From This... Friday, November 16, 12
  • 7. Disney Found Itself Forced To Transform... ... To This Friday, November 16, 12
  • 8. But What Does All This Software Do? • Software always eats data – be it text files, user form input, emails, etc • All things that eat, must eventually excrete... Friday, November 16, 12
  • 9. Ingestion = Excretion + = Yeast Ingests Sugars, and Excretes Ethanol Friday, November 16, 12
  • 10. Ingestion = Excretion = Cows, er... well, you get the point. Friday, November 16, 12
  • 11. So What Does Software Eat? • Software always eats data – be it text files, user form input, emails, etc • But what does software excrete? • More Data, of course... • This data gets bigger and bigger • The solutions become narrower for storing & processing this data • Data Fertilizes Software, in an endless cycle... Friday, November 16, 12
  • 12. There’s a Big Market Here... • Lots of Solutions for Big Data • Data Warehouse Software • Operational Databases • Old style systems being upgraded to scale storage + processing • NoSQL - Cassandra, MongoDB, etc • Platforms • Hadoop Friday, November 16, 12
  • 13. Don’t Tilt At Windmills... Friday, November 16, 12
  • 14. Don’t Tilt At Windmills... • It is easy to get distracted by all of these solutions • Keep it simple • Use tools you (and your team) can understand • Use tools and techniques that can scale • Try not to reinvent the wheel Friday, November 16, 12
  • 15. ... And Don’t Bite Off More Than You Can Chew • Break it into smaller pieces • You can’t fit a whole pig into your mouth... • ... slice it into small parts that you can consume. Friday, November 16, 12
  • 16. Big Data at a Glance Large Dataset Primary Key as “username” • Big Data can be gigabytes, terabytes, petabytes or exabytes • An ideal big data system scales up and down around various data sizes – while providing a uniform view • Major concerns • Can I read & write this data efficiently at different scale? • Can I run calculations on large portions of this data? Friday, November 16, 12
  • 17. Big Data at a Glance ... Large Dataset Primary Key as “username” • Systems like Google File System (which inspired Hadoop’s HDFS) and MongoDB’s Sharding handle the scale problem by chunking • Break up pieces of data into smaller chunks, spread across many data nodes • Each data node contains many chunks • If a chunk gets too large or a node overloaded, data can be rebalanced Friday, November 16, 12
  • 18. Chunks Represent Ranges of Values Initially, an empty collection has a single -∞ +∞ chunk, running the range of minimum (-∞) to ... INSERT {USERNAME: “Bill”} maximum (+∞) As we add data, more chunks are created of -∞ “B” “C” +∞ new ranges INSERT {USERNAME: “Becky”} INSERT {USERNAME: “Brendan”} Individual or partial letter -∞ “Ba” “Be” “Br” ranges are one possible chunk value... but they can get smaller! INSERT {USERNAME: “Brad”} The smallest possible chunk value is not a “Brad” “Brendan” range, but a single possible value Friday, November 16, 12
  • 19. Big Data at a Glance a b c d e f g h ... Large Dataset Primary Key as “username” s t u v w x y z • To simplify things, let’s look at our dataset split into chunks by letter • Each chunk is represented by a single letter marking its contents • You could think of “B” as really being “Ba” →”Bz” Friday, November 16, 12
  • 20. Big Data at a Glance a b c d e f g h Large Dataset Primary Key as “username” s t u v w x y z Friday, November 16, 12
  • 21. Big Data at a Glance Large Dataset Primary Key as “username” x b v t d f z s h e u c w a y g MongoDB Sharding ( as well as HDFS ) breaks data into chunks (~64 mb) Friday, November 16, 12
  • 22. Big Data at a Glance Data Node 1 Data Node 2 Large Dataset Node 3 Data Data Node 4 Primary Key as “username” 25% of chunks 25% of chunks 25% of chunks 25% of chunks x b v t d f z s h e u c w a y g Representing data as chunks allows many levels of scale across n data nodes Friday, November 16, 12
  • 23. Scaling Data Node 1 Data Node 2 Data Node 3 Data Node 4 5 Data Node x b v t d f z s h e u c w a y g The set of chunks can be evenly distributed across n data nodes Friday, November 16, 12
  • 24. Add Nodes: Chunk Rebalancing Data Node 1 Data Node 2 Data Node 3 Data Node 4 Data Node 5 x c b z t f v y a s u g e w h d The goal is equilibrium - an equal distribution. As nodes are added (or even removed) chunks can be redistributed for balance. Friday, November 16, 12
  • 25. Don’t Bite Off More Than You Can Chew... • The answer to calculating big data is much the same as storing it • We need to break our data into bite sized pieces • Build functions which can be composed together repeatedly on partitions of our data • Process portions of the data across multiple calculation nodes • Aggregate the results into a final set of results Friday, November 16, 12
  • 26. Bite Sized Pieces Are Easier to Swallow • These pieces are not chunks – rather, the individual data points that make up each chunk • Chunks make up a useful data transfer units for processing as well • Transfer Chunks as “Input Splits” to calculation nodes, allowing for scalable parallel processing Friday, November 16, 12
  • 27. MapReduce the Pieces • The most common application of these techniques is MapReduce • Based on a Google Whitepaper, works with two primary functions – map and reduce – to calculate against large datasets Friday, November 16, 12
  • 28. MapReduce to Calculate Big Data • MapReduce is designed to effectively process data at varying scales • Composable function units can be reused repeatedly for scaled results Friday, November 16, 12
  • 29. MapReduce to Calculate Big Data • In addition to the HDFS storage component, Hadoop is built around MapReduce for calculation • MongoDB can be integrated to MapReduce data on Hadoop • No HDFS storage needed - data moves directly between MongoDB and Hadoop’s MapReduce engine Friday, November 16, 12
  • 30. What is MapReduce? • MapReduce made up of a series of phases, the primary of which are • Map • Shuffle • Reduce • Let’s look at a typical MapReduce job • Email records • Count # of times a particular user has received email Friday, November 16, 12
  • 31. MapReducing Email to: tyler from: brendan subject: Ruby Support to: brendan from: tyler subject: Re: Ruby Support to: mike from: brendan subject: Node Support to: brendan from: mike subject: Re: Node Support to: mike from: tyler subject: COBOL Support to: tyler from: mike subject: Re: COBOL Support (WTF?) Friday, November 16, 12
  • 32. Map Step map function breaks each document to: tyler into a key (grouping) & value key: tyler from: brendan value: {count: 1} subject: Ruby Support to: brendan from: tyler key: brendan subject: Re: Ruby Support value: {count: 1} to: mike from: brendan subject: Node Support key: tyler value: {count: 1} map function to: brendan emit(k, v) from: mike subject: Re: Node Support key: mike value: {count: 1} to: mike from: tyler key: brendan subject: COBOL Support value: {count: 1} to: tyler from: mike subject: Re: COBOL Support key: mike (WTF?) value: {count: 1} Friday, November 16, 12
  • 33. Group/Shuffle Step key: tyler value: {count: 1} key: brendan Group like keys together, value: {count: 1} creating an array of their key: tyler value: {count: 1} distinct values (Automatically done by M/R frameworks) key: mike value: {count: 1} key: brendan value: {count: 1} key: mike value: {count: 1} Friday, November 16, 12
  • 34. Group/Shuffle Step Group like keys together, key: tyler creating an array of their values: [{count: 1}, {count: 1}] distinct values key: mike values: [{count: 1}, {count: 1}] (Automatically done by M/R frameworks) key: brendan values: [{count: 1}, {count: 1}] Friday, November 16, 12
  • 35. Reduce Step For each key reduce function flattens the list of values to a single result key: tyler key: mike values: [{count: 1}, value: {count: 2} {count: 1}] key: mike key: brendan reduce function values: [{count: 1}, value: {count: 2} aggregate values {count: 1}] return (result) key: brendan key: tyler values: [{count: 1}, value: {count: 2} {count: 1}] Friday, November 16, 12
  • 36. Processing Scalable Big Data • MapReduce provides an effective system for calculating and processing our large datasets (from gigabytes through exabytes and beyond) • MapReduce is supported in many places including MongoDB & Hadoop • We have effective answers for both of our concerns. • Can I read & write this data efficiently at different scale? • Can I run calculations on large portions of this data? Friday, November 16, 12
  • 37. Batch Isn’t a Sustainable Answer • There are downsides here - fundamentally, MapReduce is a batch process • Batch systems like Hadoop give us a “Catch 22” • You can get answers to questions from Petabytes of Data • But you can’t guarantee you’ll get them quickly • In some ways, this is a step backwards in our industry • Business Stakeholders tend to want answers now • We must evolve Friday, November 16, 12
  • 38. Moving Away from Batch • The Big Data world is moving rapidly away from slow, batch based processing solutions • Google moved forward from Batch into more Realtime over last few years • Hadoop is replacing “MapReduce as Assembly Language” with more flexible resource management in YARN • Now MapReduce is just a feature implemented on top of YARN • Build anything we want • Newer systems like Spark & Storm provide platforms for realtime processes Friday, November 16, 12
  • 39. In Closing • The World IS Being Eaten By Software • All that software is leaving behind an awful lot of data • We must be careful not to “step in it” • More Data Means More Software Means More Data Means... • Practical Solutions for Processing & Storing Data will save us • We as Data Scientists & Technologists must always evolve our strategies, thinking and tools Friday, November 16, 12
  • 40. [Download the Hadoop Connector] http://github.com/mongodb/mongo-hadoop [Docs] http://api.mongodb.org/hadoop/ ¿QUESTIONS? *Contact Me* brendan@10gen.com (twitter: @rit) Friday, November 16, 12