SlideShare ist ein Scribd-Unternehmen logo
1 von 81
Downloaden Sie, um offline zu lesen
7: Shortcomings in the MapReduce Paradigm

                                         Zubair Nabi

                               zubair.nabi@itu.edu.pk


                                       April 19, 2013




Zubair Nabi           7: Shortcomings in the MapReduce Paradigm   April 19, 2013   1 / 31
Outline

  1    Hadoop everywhere!

  2    Skew

  3    Heterogeneous Environment

  4    Low-level Programming Interface

  5    Strictly Batch-processing

  6    Single-input/single output and Two-phase

  7    Iterative and Recursive Applications

  8    Incremental Computation



  Zubair Nabi          7: Shortcomings in the MapReduce Paradigm   April 19, 2013   2 / 31
Outline

  1    Hadoop everywhere!

  2    Skew

  3    Heterogeneous Environment

  4    Low-level Programming Interface

  5    Strictly Batch-processing

  6    Single-input/single output and Two-phase

  7    Iterative and Recursive Applications

  8    Incremental Computation



  Zubair Nabi          7: Shortcomings in the MapReduce Paradigm   April 19, 2013   3 / 31
Users1



          Adobe: Several areas from social services to unstructured data storage
          and processing




     1
         http://wiki.apache.org/hadoop/PoweredBy
  Zubair Nabi           7: Shortcomings in the MapReduce Paradigm   April 19, 2013   4 / 31
Users1



          Adobe: Several areas from social services to unstructured data storage
          and processing
          eBay: 532 nodes cluster storing 5.3PB of data




     1
         http://wiki.apache.org/hadoop/PoweredBy
  Zubair Nabi           7: Shortcomings in the MapReduce Paradigm   April 19, 2013   4 / 31
Users1



          Adobe: Several areas from social services to unstructured data storage
          and processing
          eBay: 532 nodes cluster storing 5.3PB of data
          Facebook: Used for reporting/analytics; one cluster with 1100 nodes
          (12PB) and another with 300 nodes (3PB)




     1
         http://wiki.apache.org/hadoop/PoweredBy
  Zubair Nabi           7: Shortcomings in the MapReduce Paradigm   April 19, 2013   4 / 31
Users1



          Adobe: Several areas from social services to unstructured data storage
          and processing
          eBay: 532 nodes cluster storing 5.3PB of data
          Facebook: Used for reporting/analytics; one cluster with 1100 nodes
          (12PB) and another with 300 nodes (3PB)
          LinkedIn: 3 clusters with collectively 4000 nodes




     1
         http://wiki.apache.org/hadoop/PoweredBy
  Zubair Nabi            7: Shortcomings in the MapReduce Paradigm   April 19, 2013   4 / 31
Users1



          Adobe: Several areas from social services to unstructured data storage
          and processing
          eBay: 532 nodes cluster storing 5.3PB of data
          Facebook: Used for reporting/analytics; one cluster with 1100 nodes
          (12PB) and another with 300 nodes (3PB)
          LinkedIn: 3 clusters with collectively 4000 nodes
          Twitter: To store and process Tweets and log files




     1
         http://wiki.apache.org/hadoop/PoweredBy
  Zubair Nabi            7: Shortcomings in the MapReduce Paradigm   April 19, 2013   4 / 31
Users1



          Adobe: Several areas from social services to unstructured data storage
          and processing
          eBay: 532 nodes cluster storing 5.3PB of data
          Facebook: Used for reporting/analytics; one cluster with 1100 nodes
          (12PB) and another with 300 nodes (3PB)
          LinkedIn: 3 clusters with collectively 4000 nodes
          Twitter: To store and process Tweets and log files
          Yahoo!: Multiple clusters with collectively 40000 nodes; largest cluster
          has 4500 nodes!




     1
         http://wiki.apache.org/hadoop/PoweredBy
  Zubair Nabi            7: Shortcomings in the MapReduce Paradigm    April 19, 2013   4 / 31
But all is not well

          Over the years, Hadoop has become a one-size-fits-all solution to data
          intensive computing




  Zubair Nabi           7: Shortcomings in the MapReduce Paradigm   April 19, 2013   5 / 31
But all is not well

          Over the years, Hadoop has become a one-size-fits-all solution to data
          intensive computing
          As early as 2008, David DeWitt and Michael Stonebraker asserted that
          MapReduce was a “major step backwards” for data intensive
          computing




  Zubair Nabi           7: Shortcomings in the MapReduce Paradigm   April 19, 2013   5 / 31
But all is not well

          Over the years, Hadoop has become a one-size-fits-all solution to data
          intensive computing
          As early as 2008, David DeWitt and Michael Stonebraker asserted that
          MapReduce was a “major step backwards” for data intensive
          computing
          They opined:
                MapReduce is a major step backwards in database access because it
                negates schema and is too low-level




  Zubair Nabi             7: Shortcomings in the MapReduce Paradigm   April 19, 2013   5 / 31
But all is not well

          Over the years, Hadoop has become a one-size-fits-all solution to data
          intensive computing
          As early as 2008, David DeWitt and Michael Stonebraker asserted that
          MapReduce was a “major step backwards” for data intensive
          computing
          They opined:
                MapReduce is a major step backwards in database access because it
                negates schema and is too low-level
                It has a sub-optimal implementation as it, makes use of brute force
                instead of indexing, does not handle skew, and uses data pull instead of
                push




  Zubair Nabi             7: Shortcomings in the MapReduce Paradigm       April 19, 2013   5 / 31
But all is not well

          Over the years, Hadoop has become a one-size-fits-all solution to data
          intensive computing
          As early as 2008, David DeWitt and Michael Stonebraker asserted that
          MapReduce was a “major step backwards” for data intensive
          computing
          They opined:
                MapReduce is a major step backwards in database access because it
                negates schema and is too low-level
                It has a sub-optimal implementation as it, makes use of brute force
                instead of indexing, does not handle skew, and uses data pull instead of
                push
                It is just rehashing old database concepts




  Zubair Nabi             7: Shortcomings in the MapReduce Paradigm       April 19, 2013   5 / 31
But all is not well

          Over the years, Hadoop has become a one-size-fits-all solution to data
          intensive computing
          As early as 2008, David DeWitt and Michael Stonebraker asserted that
          MapReduce was a “major step backwards” for data intensive
          computing
          They opined:
                MapReduce is a major step backwards in database access because it
                negates schema and is too low-level
                It has a sub-optimal implementation as it, makes use of brute force
                instead of indexing, does not handle skew, and uses data pull instead of
                push
                It is just rehashing old database concepts
                It is missing most DBMS functionalities, such as updates, transactions,
                etc.




  Zubair Nabi             7: Shortcomings in the MapReduce Paradigm       April 19, 2013   5 / 31
But all is not well

          Over the years, Hadoop has become a one-size-fits-all solution to data
          intensive computing
          As early as 2008, David DeWitt and Michael Stonebraker asserted that
          MapReduce was a “major step backwards” for data intensive
          computing
          They opined:
                MapReduce is a major step backwards in database access because it
                negates schema and is too low-level
                It has a sub-optimal implementation as it, makes use of brute force
                instead of indexing, does not handle skew, and uses data pull instead of
                push
                It is just rehashing old database concepts
                It is missing most DBMS functionalities, such as updates, transactions,
                etc.
                It is incompatible with DBMS tools, such as human visualization, data
                replication from one DBMS to another, etc.


  Zubair Nabi             7: Shortcomings in the MapReduce Paradigm       April 19, 2013   5 / 31
Outline

  1    Hadoop everywhere!

  2    Skew

  3    Heterogeneous Environment

  4    Low-level Programming Interface

  5    Strictly Batch-processing

  6    Single-input/single output and Two-phase

  7    Iterative and Recursive Applications

  8    Incremental Computation



  Zubair Nabi          7: Shortcomings in the MapReduce Paradigm   April 19, 2013   6 / 31
Introduction




          Due to the uneven distribution of intermediate key/value pairs some
          reduce workers end up doing more work




  Zubair Nabi           7: Shortcomings in the MapReduce Paradigm   April 19, 2013   7 / 31
Introduction




          Due to the uneven distribution of intermediate key/value pairs some
          reduce workers end up doing more work
          Such reducers become “stragglers”




  Zubair Nabi           7: Shortcomings in the MapReduce Paradigm   April 19, 2013   7 / 31
Introduction




          Due to the uneven distribution of intermediate key/value pairs some
          reduce workers end up doing more work
          Such reducers become “stragglers”
          A large number of real-world applications follow long-tailed distributions
          (Zipf-like)




  Zubair Nabi            7: Shortcomings in the MapReduce Paradigm     April 19, 2013   7 / 31
Wordcount and skew
          Text corpora have a Zipfian skew, i.e. a very small number of words
          account for most occurrences




  Zubair Nabi           7: Shortcomings in the MapReduce Paradigm   April 19, 2013   8 / 31
Wordcount and skew
          Text corpora have a Zipfian skew, i.e. a very small number of words
          account for most occurrences




          For instance, of 242,758 words in the dataset used to generate the
          figure, the 10, 100, and 1000 most frequent words account for 22%,
          43%, and 64% of the entire set



  Zubair Nabi           7: Shortcomings in the MapReduce Paradigm   April 19, 2013   8 / 31
Wordcount and skew
          Text corpora have a Zipfian skew, i.e. a very small number of words
          account for most occurrences




          For instance, of 242,758 words in the dataset used to generate the
          figure, the 10, 100, and 1000 most frequent words account for 22%,
          43%, and 64% of the entire set
          Such skewed intermediate results lead to uneven distribution of
          workload across reduce workers
  Zubair Nabi           7: Shortcomings in the MapReduce Paradigm   April 19, 2013   8 / 31
Page rank and skew

          Even Google’s implementation of its core PageRank algorithm is
          plagued by the skew problem




  Zubair Nabi           7: Shortcomings in the MapReduce Paradigm   April 19, 2013   9 / 31
Page rank and skew

          Even Google’s implementation of its core PageRank algorithm is
          plagued by the skew problem
          Google uses PageRank to calculate a webpage’s relevance for a given
          search query




  Zubair Nabi           7: Shortcomings in the MapReduce Paradigm   April 19, 2013   9 / 31
Page rank and skew

          Even Google’s implementation of its core PageRank algorithm is
          plagued by the skew problem
          Google uses PageRank to calculate a webpage’s relevance for a given
          search query
                Map: Emit the outlinks for each page




  Zubair Nabi             7: Shortcomings in the MapReduce Paradigm   April 19, 2013   9 / 31
Page rank and skew

          Even Google’s implementation of its core PageRank algorithm is
          plagued by the skew problem
          Google uses PageRank to calculate a webpage’s relevance for a given
          search query
                Map: Emit the outlinks for each page
                Reduce: Calculate rank per page




  Zubair Nabi             7: Shortcomings in the MapReduce Paradigm   April 19, 2013   9 / 31
Page rank and skew

          Even Google’s implementation of its core PageRank algorithm is
          plagued by the skew problem
          Google uses PageRank to calculate a webpage’s relevance for a given
          search query
                Map: Emit the outlinks for each page
                Reduce: Calculate rank per page
          The skew in intermediate data exists due to the huge disparity in the
          number of incoming links across pages on the Internet




  Zubair Nabi             7: Shortcomings in the MapReduce Paradigm   April 19, 2013   9 / 31
Page rank and skew

          Even Google’s implementation of its core PageRank algorithm is
          plagued by the skew problem
          Google uses PageRank to calculate a webpage’s relevance for a given
          search query
                Map: Emit the outlinks for each page
                Reduce: Calculate rank per page
          The skew in intermediate data exists due to the huge disparity in the
          number of incoming links across pages on the Internet
          The scale of the problem is evident when we consider the fact that
          Google currently indexes more than 25 billion webpages with skewed
          links




  Zubair Nabi             7: Shortcomings in the MapReduce Paradigm   April 19, 2013   9 / 31
Page rank and skew

          Even Google’s implementation of its core PageRank algorithm is
          plagued by the skew problem
          Google uses PageRank to calculate a webpage’s relevance for a given
          search query
                Map: Emit the outlinks for each page
                Reduce: Calculate rank per page
          The skew in intermediate data exists due to the huge disparity in the
          number of incoming links across pages on the Internet
          The scale of the problem is evident when we consider the fact that
          Google currently indexes more than 25 billion webpages with skewed
          links
          For instance, Facebook has 49,376,609 incoming links (at the time of
          writing) while the personal webpage of the presenter only has 4 (=))



  Zubair Nabi             7: Shortcomings in the MapReduce Paradigm   April 19, 2013   9 / 31
Zipf distributions are everywhere




          Followed by Inverted Indexing, Publish/Subscribe systems, fraud
          detection, and various clustering algorithms




  Zubair Nabi           7: Shortcomings in the MapReduce Paradigm   April 19, 2013   10 / 31
Zipf distributions are everywhere




          Followed by Inverted Indexing, Publish/Subscribe systems, fraud
          detection, and various clustering algorithms
          P2P systems have Zipf distributions too both in terms of users and
          content




  Zubair Nabi           7: Shortcomings in the MapReduce Paradigm   April 19, 2013   10 / 31
Zipf distributions are everywhere




          Followed by Inverted Indexing, Publish/Subscribe systems, fraud
          detection, and various clustering algorithms
          P2P systems have Zipf distributions too both in terms of users and
          content
          Web caching schemes as well as email and social networks




  Zubair Nabi           7: Shortcomings in the MapReduce Paradigm   April 19, 2013   10 / 31
Outline

  1    Hadoop everywhere!

  2    Skew

  3    Heterogeneous Environment

  4    Low-level Programming Interface

  5    Strictly Batch-processing

  6    Single-input/single output and Two-phase

  7    Iterative and Recursive Applications

  8    Incremental Computation



  Zubair Nabi          7: Shortcomings in the MapReduce Paradigm   April 19, 2013   11 / 31
Introduction



          In the MapReduce model, tasks which take exceptionally long are
          labelled “stragglers”




  Zubair Nabi           7: Shortcomings in the MapReduce Paradigm   April 19, 2013   12 / 31
Introduction



          In the MapReduce model, tasks which take exceptionally long are
          labelled “stragglers”
          The framework launches a speculative copy of each straggler on
          another machine expecting it to finish quickly




  Zubair Nabi           7: Shortcomings in the MapReduce Paradigm   April 19, 2013   12 / 31
Introduction



          In the MapReduce model, tasks which take exceptionally long are
          labelled “stragglers”
          The framework launches a speculative copy of each straggler on
          another machine expecting it to finish quickly
          Without this, the overall job completion time is dictated by the slowest
          straggler




  Zubair Nabi            7: Shortcomings in the MapReduce Paradigm    April 19, 2013   12 / 31
Introduction



          In the MapReduce model, tasks which take exceptionally long are
          labelled “stragglers”
          The framework launches a speculative copy of each straggler on
          another machine expecting it to finish quickly
          Without this, the overall job completion time is dictated by the slowest
          straggler
          On Google clusters, speculative execution can reduce job completion
          by 44%




  Zubair Nabi            7: Shortcomings in the MapReduce Paradigm    April 19, 2013   12 / 31
Hadoop’s assumptions regarding speculation


     1    All nodes are equal, i.e. they can perform work at more or less the
          same rate




  Zubair Nabi            7: Shortcomings in the MapReduce Paradigm   April 19, 2013   13 / 31
Hadoop’s assumptions regarding speculation


     1    All nodes are equal, i.e. they can perform work at more or less the
          same rate
     2    Tasks make progress at a constant rate throughout their lifetime




  Zubair Nabi            7: Shortcomings in the MapReduce Paradigm   April 19, 2013   13 / 31
Hadoop’s assumptions regarding speculation


     1    All nodes are equal, i.e. they can perform work at more or less the
          same rate
     2    Tasks make progress at a constant rate throughout their lifetime
     3    There is no cost of launching a speculative cost on an otherwise idle
          slot/node




  Zubair Nabi            7: Shortcomings in the MapReduce Paradigm   April 19, 2013   13 / 31
Hadoop’s assumptions regarding speculation


     1    All nodes are equal, i.e. they can perform work at more or less the
          same rate
     2    Tasks make progress at a constant rate throughout their lifetime
     3    There is no cost of launching a speculative cost on an otherwise idle
          slot/node
     4    The progress score of a task captures the fraction of its total work that
          it has done. Specifically, the shuffle, merge, and reduce logic phases
          each take roughly 1/3 of the total time




  Zubair Nabi            7: Shortcomings in the MapReduce Paradigm    April 19, 2013   13 / 31
Hadoop’s assumptions regarding speculation


     1    All nodes are equal, i.e. they can perform work at more or less the
          same rate
     2    Tasks make progress at a constant rate throughout their lifetime
     3    There is no cost of launching a speculative cost on an otherwise idle
          slot/node
     4    The progress score of a task captures the fraction of its total work that
          it has done. Specifically, the shuffle, merge, and reduce logic phases
          each take roughly 1/3 of the total time
     5    As tasks finish in waves, a task with a low progress score is most likely
          a straggler




  Zubair Nabi            7: Shortcomings in the MapReduce Paradigm    April 19, 2013   13 / 31
Hadoop’s assumptions regarding speculation


     1    All nodes are equal, i.e. they can perform work at more or less the
          same rate
     2    Tasks make progress at a constant rate throughout their lifetime
     3    There is no cost of launching a speculative cost on an otherwise idle
          slot/node
     4    The progress score of a task captures the fraction of its total work that
          it has done. Specifically, the shuffle, merge, and reduce logic phases
          each take roughly 1/3 of the total time
     5    As tasks finish in waves, a task with a low progress score is most likely
          a straggler
     6    Tasks within the same phase, require roughly the same amount of work




  Zubair Nabi            7: Shortcomings in the MapReduce Paradigm    April 19, 2013   13 / 31
Assumptions 1 and 2




     1    All nodes are equal, i.e. they can perform work at more or less the
          same rate




  Zubair Nabi            7: Shortcomings in the MapReduce Paradigm   April 19, 2013   14 / 31
Assumptions 1 and 2




     1    All nodes are equal, i.e. they can perform work at more or less the
          same rate
     2    Tasks make progress at a constant rate throughout their lifetime




  Zubair Nabi            7: Shortcomings in the MapReduce Paradigm   April 19, 2013   14 / 31
Assumptions 1 and 2




     1    All nodes are equal, i.e. they can perform work at more or less the
          same rate
     2    Tasks make progress at a constant rate throughout their lifetime

          Both breakdown in heterogeneous environments which consist of
          multiple generations of hardware




  Zubair Nabi            7: Shortcomings in the MapReduce Paradigm   April 19, 2013   14 / 31
Assumption 3




     3    There is no cost of launching a speculative cost on an otherwise idle
          slot/node




  Zubair Nabi            7: Shortcomings in the MapReduce Paradigm   April 19, 2013   15 / 31
Assumption 3




     3    There is no cost of launching a speculative cost on an otherwise idle
          slot/node

          Breaks down due to shared resources




  Zubair Nabi            7: Shortcomings in the MapReduce Paradigm   April 19, 2013   15 / 31
Assumption 4




     4    The progress score of a task captures the fraction of its total work that
          it has done. Specifically, the shuffle, merge, and reduce logic phases
          each take roughly 1/3 of the total time




  Zubair Nabi            7: Shortcomings in the MapReduce Paradigm    April 19, 2013   16 / 31
Assumption 4




     4    The progress score of a task captures the fraction of its total work that
          it has done. Specifically, the shuffle, merge, and reduce logic phases
          each take roughly 1/3 of the total time

          Breaks down due the fact that in reduce tasks the shuffle phase takes
          the longest time as opposed to the other 2




  Zubair Nabi            7: Shortcomings in the MapReduce Paradigm    April 19, 2013   16 / 31
Assumption 5




     5    As tasks finish in waves, a task with a low progress score is most likely
          a straggler




  Zubair Nabi            7: Shortcomings in the MapReduce Paradigm   April 19, 2013   17 / 31
Assumption 5




     5    As tasks finish in waves, a task with a low progress score is most likely
          a straggler

          Breaks down due to the fact that task completion is spread across time
          due to uneven workload




  Zubair Nabi            7: Shortcomings in the MapReduce Paradigm   April 19, 2013   17 / 31
Assumption 6




     6    Tasks within the same phase, require roughly the same amount of work




  Zubair Nabi           7: Shortcomings in the MapReduce Paradigm   April 19, 2013   18 / 31
Assumption 6




     6    Tasks within the same phase, require roughly the same amount of work

          Breaks down due to data skew




  Zubair Nabi           7: Shortcomings in the MapReduce Paradigm   April 19, 2013   18 / 31
Outline

  1    Hadoop everywhere!

  2    Skew

  3    Heterogeneous Environment

  4    Low-level Programming Interface

  5    Strictly Batch-processing

  6    Single-input/single output and Two-phase

  7    Iterative and Recursive Applications

  8    Incremental Computation



  Zubair Nabi          7: Shortcomings in the MapReduce Paradigm   April 19, 2013   19 / 31
Introduction



          The one-input, two-stage data flow is extremely rigid for ad-hoc
          analysis of large datasets




  Zubair Nabi            7: Shortcomings in the MapReduce Paradigm   April 19, 2013   20 / 31
Introduction



          The one-input, two-stage data flow is extremely rigid for ad-hoc
          analysis of large datasets
          Hacks need to be put into place for different data flow, such as joins or
          multiple stages




  Zubair Nabi            7: Shortcomings in the MapReduce Paradigm   April 19, 2013   20 / 31
Introduction



          The one-input, two-stage data flow is extremely rigid for ad-hoc
          analysis of large datasets
          Hacks need to be put into place for different data flow, such as joins or
          multiple stages
          Custom code has to be written for common DB operations, such as
          projection and filtering




  Zubair Nabi            7: Shortcomings in the MapReduce Paradigm   April 19, 2013   20 / 31
Introduction



          The one-input, two-stage data flow is extremely rigid for ad-hoc
          analysis of large datasets
          Hacks need to be put into place for different data flow, such as joins or
          multiple stages
          Custom code has to be written for common DB operations, such as
          projection and filtering
          The opaque nature of map and reduce functions makes it impossible to
          perform optimizations, such as operator reordering




  Zubair Nabi            7: Shortcomings in the MapReduce Paradigm   April 19, 2013   20 / 31
Outline

  1    Hadoop everywhere!

  2    Skew

  3    Heterogeneous Environment

  4    Low-level Programming Interface

  5    Strictly Batch-processing

  6    Single-input/single output and Two-phase

  7    Iterative and Recursive Applications

  8    Incremental Computation



  Zubair Nabi          7: Shortcomings in the MapReduce Paradigm   April 19, 2013   21 / 31
Introduction




          In case of MapReduce, the entire output of a map or a reduce task
          needs to be materialized to local storage before the next stage can
          commence




  Zubair Nabi            7: Shortcomings in the MapReduce Paradigm   April 19, 2013   22 / 31
Introduction




          In case of MapReduce, the entire output of a map or a reduce task
          needs to be materialized to local storage before the next stage can
          commence
          Simplifies fault-tolerance




  Zubair Nabi            7: Shortcomings in the MapReduce Paradigm   April 19, 2013   22 / 31
Introduction




          In case of MapReduce, the entire output of a map or a reduce task
          needs to be materialized to local storage before the next stage can
          commence
          Simplifies fault-tolerance
          Reducers have to pull their input instead of the mappers pushing it




  Zubair Nabi            7: Shortcomings in the MapReduce Paradigm   April 19, 2013   22 / 31
Introduction




          In case of MapReduce, the entire output of a map or a reduce task
          needs to be materialized to local storage before the next stage can
          commence
          Simplifies fault-tolerance
          Reducers have to pull their input instead of the mappers pushing it
          Negates pipelining, result estimation, and continuous queries (stream
          processing)




  Zubair Nabi            7: Shortcomings in the MapReduce Paradigm   April 19, 2013   22 / 31
Outline

  1    Hadoop everywhere!

  2    Skew

  3    Heterogeneous Environment

  4    Low-level Programming Interface

  5    Strictly Batch-processing

  6    Single-input/single output and Two-phase

  7    Iterative and Recursive Applications

  8    Incremental Computation



  Zubair Nabi          7: Shortcomings in the MapReduce Paradigm   April 19, 2013   23 / 31
Introduction




     1    Not all applications can be broken down into just two-phases, such as
          complex SQL-like queries




  Zubair Nabi           7: Shortcomings in the MapReduce Paradigm   April 19, 2013   24 / 31
Introduction




     1    Not all applications can be broken down into just two-phases, such as
          complex SQL-like queries
     2    Tasks take in just one input and produce one output




  Zubair Nabi           7: Shortcomings in the MapReduce Paradigm   April 19, 2013   24 / 31
Outline

  1    Hadoop everywhere!

  2    Skew

  3    Heterogeneous Environment

  4    Low-level Programming Interface

  5    Strictly Batch-processing

  6    Single-input/single output and Two-phase

  7    Iterative and Recursive Applications

  8    Incremental Computation



  Zubair Nabi          7: Shortcomings in the MapReduce Paradigm   April 19, 2013   25 / 31
Introduction




     1    Hadoop is widely employed for iterative computations




  Zubair Nabi           7: Shortcomings in the MapReduce Paradigm   April 19, 2013   26 / 31
Introduction




     1    Hadoop is widely employed for iterative computations
     2    For machine learning applications, the Apache Mahout library is used
          atop Hadoop




  Zubair Nabi           7: Shortcomings in the MapReduce Paradigm   April 19, 2013   26 / 31
Introduction




     1    Hadoop is widely employed for iterative computations
     2    For machine learning applications, the Apache Mahout library is used
          atop Hadoop
     3    Mahout uses an external driver program to submit multiple jobs to
          Hadoop and perform a convergence test




  Zubair Nabi           7: Shortcomings in the MapReduce Paradigm   April 19, 2013   26 / 31
Introduction




     1    Hadoop is widely employed for iterative computations
     2    For machine learning applications, the Apache Mahout library is used
          atop Hadoop
     3    Mahout uses an external driver program to submit multiple jobs to
          Hadoop and perform a convergence test
     4    No fault-tolerance and overhead of job submission




  Zubair Nabi           7: Shortcomings in the MapReduce Paradigm   April 19, 2013   26 / 31
Introduction




     1    Hadoop is widely employed for iterative computations
     2    For machine learning applications, the Apache Mahout library is used
          atop Hadoop
     3    Mahout uses an external driver program to submit multiple jobs to
          Hadoop and perform a convergence test
     4    No fault-tolerance and overhead of job submission
     5    Loop-invariant data is materialized to storage




  Zubair Nabi            7: Shortcomings in the MapReduce Paradigm   April 19, 2013   26 / 31
Outline

  1    Hadoop everywhere!

  2    Skew

  3    Heterogeneous Environment

  4    Low-level Programming Interface

  5    Strictly Batch-processing

  6    Single-input/single output and Two-phase

  7    Iterative and Recursive Applications

  8    Incremental Computation



  Zubair Nabi          7: Shortcomings in the MapReduce Paradigm   April 19, 2013   27 / 31
Introduction




     1    Most workloads processed by MapReduce are incremental by nature,
          i.e. MapReduce jobs often run repeatedly with small changes in their
          input




  Zubair Nabi           7: Shortcomings in the MapReduce Paradigm   April 19, 2013   28 / 31
Introduction




     1    Most workloads processed by MapReduce are incremental by nature,
          i.e. MapReduce jobs often run repeatedly with small changes in their
          input
     2    For instance, most iterations of PageRank run with very small
          modifications




  Zubair Nabi           7: Shortcomings in the MapReduce Paradigm   April 19, 2013   28 / 31
Introduction




     1    Most workloads processed by MapReduce are incremental by nature,
          i.e. MapReduce jobs often run repeatedly with small changes in their
          input
     2    For instance, most iterations of PageRank run with very small
          modifications
     3    Unfortunately, even with a small change in input, MapReduce
          re-performs the entire computation




  Zubair Nabi           7: Shortcomings in the MapReduce Paradigm   April 19, 2013   28 / 31
References

     1    MapReduce: A major step backwards:
          http://homes.cs.washington.edu/~billhowe/
          mapreduce_a_major_step_backwards.html
     2    Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, and
          Ion Stoica. 2008. Improving MapReduce performance in
          heterogeneous environments. In Proceedings of the 8th USENIX
          conference on Operating systems design and implementation
          (OSDI’08). USENIX Association, Berkeley, CA, USA, 29-42.
     3    Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar,
          and Andrew Tomkins. 2008. Pig latin: a not-so-foreign language for
          data processing. In Proceedings of the 2008 ACM SIGMOD
          international conference on Management of data (SIGMOD ’08). ACM,
          New York, NY, USA, 1099-1110.



  Zubair Nabi           7: Shortcomings in the MapReduce Paradigm   April 19, 2013   29 / 31
References (2)

     4    Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein,
          Khaled Elmeleegy, and Russell Sears. 2010. MapReduce online. In
          Proceedings of the 7th USENIX conference on Networked systems
          design and implementation (NSDI’10). USENIX Association, Berkeley,
          CA, USA.
     5    Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis
          Fetterly. 2007. Dryad: distributed data-parallel programs from
          sequential building blocks. In Proceedings of the 2nd ACM
          SIGOPS/EuroSys European Conference on Computer Systems 2007
          (EuroSys ’07). ACM, New York, NY, USA, 59-72.
     6    Derek G. Murray, Malte Schwarzkopf, Christopher Smowton, Steven
          Smith, Anil Madhavapeddy, and Steven Hand. 2011. CIEL: a universal
          execution engine for distributed data-flow computing. In Proceedings of
          the 8th USENIX conference on Networked systems design and
          implementation (NSDI’11). USENIX Association, Berkeley, CA, USA.

  Zubair Nabi           7: Shortcomings in the MapReduce Paradigm   April 19, 2013   30 / 31
References (3)




     7    Pramod Bhatotia, Alexander Wieder, Rodrigo Rodrigues, Umut A.
          Acar, and Rafael Pasquin. 2011. Incoop: MapReduce for incremental
          computations. In Proceedings of the 2nd ACM Symposium on Cloud
          Computing (SOCC ’11). ACM, New York, NY, USA.




  Zubair Nabi           7: Shortcomings in the MapReduce Paradigm   April 19, 2013   31 / 31

Weitere ähnliche Inhalte

Was ist angesagt?

Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...Neo4j
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku
 
Large scale computing
Large scale computing Large scale computing
Large scale computing Bhupesh Bansal
 
Big Tools for Big Data
Big Tools for Big DataBig Tools for Big Data
Big Tools for Big DataLewis Crawford
 
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages JaunesBreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages JaunesDataiku
 
Big Data Final Presentation
Big Data Final PresentationBig Data Final Presentation
Big Data Final Presentation17aroumougamh
 
A brief history of "big data"
A brief history of "big data"A brief history of "big data"
A brief history of "big data"Nicola Ferraro
 
How to build and run a big data platform in the 21st century
How to build and run a big data platform in the 21st centuryHow to build and run a big data platform in the 21st century
How to build and run a big data platform in the 21st centuryAli Dasdan
 

Was ist angesagt? (9)

Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
Big Tools for Big Data
Big Tools for Big DataBig Tools for Big Data
Big Tools for Big Data
 
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages JaunesBreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
 
Big Data Final Presentation
Big Data Final PresentationBig Data Final Presentation
Big Data Final Presentation
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
A brief history of "big data"
A brief history of "big data"A brief history of "big data"
A brief history of "big data"
 
How to build and run a big data platform in the 21st century
How to build and run a big data platform in the 21st centuryHow to build and run a big data platform in the 21st century
How to build and run a big data platform in the 21st century
 

Ähnlich wie Topic 7: Shortcomings in the MapReduce Paradigm

big-data-analytics-using-hadoop.pptx for project
big-data-analytics-using-hadoop.pptx for projectbig-data-analytics-using-hadoop.pptx for project
big-data-analytics-using-hadoop.pptx for projectBendalamSricharan
 
Topic 5: MapReduce Theory and Implementation
Topic 5: MapReduce Theory and ImplementationTopic 5: MapReduce Theory and Implementation
Topic 5: MapReduce Theory and ImplementationZubair Nabi
 
MapReduce Application Scripting
MapReduce Application ScriptingMapReduce Application Scripting
MapReduce Application ScriptingZubair Nabi
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar ReportAtul Kushwaha
 
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking AlgorithmPerformance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking AlgorithmIRJET Journal
 
A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniquesijsrd.com
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reducePaladion Networks
 
Bigdata and hadoop
Bigdata and hadoopBigdata and hadoop
Bigdata and hadoopAditi Yadav
 
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCESURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCEAM Publications,India
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
 
How Google Does Big Data - DevNexus 2014
How Google Does Big Data - DevNexus 2014How Google Does Big Data - DevNexus 2014
How Google Does Big Data - DevNexus 2014James Chittenden
 
CCS334 BIG DATA ANALYTICS Session 3 Distributed models.pptx
CCS334 BIG DATA ANALYTICS Session 3 Distributed models.pptxCCS334 BIG DATA ANALYTICS Session 3 Distributed models.pptx
CCS334 BIG DATA ANALYTICS Session 3 Distributed models.pptxAsst.prof M.Gokilavani
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelEditor IJCATR
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceCsaba Toth
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and HadoopGreyCampus
 

Ähnlich wie Topic 7: Shortcomings in the MapReduce Paradigm (20)

big data and hadoop
big data and hadoopbig data and hadoop
big data and hadoop
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
big-data-analytics-using-hadoop.pptx for project
big-data-analytics-using-hadoop.pptx for projectbig-data-analytics-using-hadoop.pptx for project
big-data-analytics-using-hadoop.pptx for project
 
Topic 5: MapReduce Theory and Implementation
Topic 5: MapReduce Theory and ImplementationTopic 5: MapReduce Theory and Implementation
Topic 5: MapReduce Theory and Implementation
 
Big data edel
Big data edelBig data edel
Big data edel
 
IJET-V2I6P25
IJET-V2I6P25IJET-V2I6P25
IJET-V2I6P25
 
MapReduce Application Scripting
MapReduce Application ScriptingMapReduce Application Scripting
MapReduce Application Scripting
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking AlgorithmPerformance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
 
Big Data & Hadoop. Simone Leo (CRS4)
Big Data & Hadoop. Simone Leo (CRS4)Big Data & Hadoop. Simone Leo (CRS4)
Big Data & Hadoop. Simone Leo (CRS4)
 
A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniques
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
 
Bigdata and hadoop
Bigdata and hadoopBigdata and hadoop
Bigdata and hadoop
 
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCESURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 
How Google Does Big Data - DevNexus 2014
How Google Does Big Data - DevNexus 2014How Google Does Big Data - DevNexus 2014
How Google Does Big Data - DevNexus 2014
 
CCS334 BIG DATA ANALYTICS Session 3 Distributed models.pptx
CCS334 BIG DATA ANALYTICS Session 3 Distributed models.pptxCCS334 BIG DATA ANALYTICS Session 3 Distributed models.pptx
CCS334 BIG DATA ANALYTICS Session 3 Distributed models.pptx
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus Model
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 

Mehr von Zubair Nabi

AOS Lab 12: Network Communication
AOS Lab 12: Network CommunicationAOS Lab 12: Network Communication
AOS Lab 12: Network CommunicationZubair Nabi
 
AOS Lab 11: Virtualization
AOS Lab 11: VirtualizationAOS Lab 11: Virtualization
AOS Lab 11: VirtualizationZubair Nabi
 
AOS Lab 10: File system -- Inodes and beyond
AOS Lab 10: File system -- Inodes and beyondAOS Lab 10: File system -- Inodes and beyond
AOS Lab 10: File system -- Inodes and beyondZubair Nabi
 
AOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 9: File system -- Of buffers, logs, and blocksAOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 9: File system -- Of buffers, logs, and blocksZubair Nabi
 
AOS Lab 8: Interrupts and Device Drivers
AOS Lab 8: Interrupts and Device DriversAOS Lab 8: Interrupts and Device Drivers
AOS Lab 8: Interrupts and Device DriversZubair Nabi
 
AOS Lab 7: Page tables
AOS Lab 7: Page tablesAOS Lab 7: Page tables
AOS Lab 7: Page tablesZubair Nabi
 
AOS Lab 6: Scheduling
AOS Lab 6: SchedulingAOS Lab 6: Scheduling
AOS Lab 6: SchedulingZubair Nabi
 
AOS Lab 5: System calls
AOS Lab 5: System callsAOS Lab 5: System calls
AOS Lab 5: System callsZubair Nabi
 
AOS Lab 4: If you liked it, then you should have put a “lock” on it
AOS Lab 4: If you liked it, then you should have put a “lock” on itAOS Lab 4: If you liked it, then you should have put a “lock” on it
AOS Lab 4: If you liked it, then you should have put a “lock” on itZubair Nabi
 
AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!Zubair Nabi
 
AOS Lab 2: Hello, xv6!
AOS Lab 2: Hello, xv6!AOS Lab 2: Hello, xv6!
AOS Lab 2: Hello, xv6!Zubair Nabi
 
AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!Zubair Nabi
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data StackZubair Nabi
 
Raabta: Low-cost Video Conferencing for the Developing World
Raabta: Low-cost Video Conferencing for the Developing WorldRaabta: Low-cost Video Conferencing for the Developing World
Raabta: Low-cost Video Conferencing for the Developing WorldZubair Nabi
 
The Anatomy of Web Censorship in Pakistan
The Anatomy of Web Censorship in PakistanThe Anatomy of Web Censorship in Pakistan
The Anatomy of Web Censorship in PakistanZubair Nabi
 
MapReduce and DBMS Hybrids
MapReduce and DBMS HybridsMapReduce and DBMS Hybrids
MapReduce and DBMS HybridsZubair Nabi
 
Topic 15: Datacenter Design and Networking
Topic 15: Datacenter Design and NetworkingTopic 15: Datacenter Design and Networking
Topic 15: Datacenter Design and NetworkingZubair Nabi
 
Topic 14: Operating Systems and Virtualization
Topic 14: Operating Systems and VirtualizationTopic 14: Operating Systems and Virtualization
Topic 14: Operating Systems and VirtualizationZubair Nabi
 
Topic 13: Cloud Stacks
Topic 13: Cloud StacksTopic 13: Cloud Stacks
Topic 13: Cloud StacksZubair Nabi
 
Lab 5: Interconnecting a Datacenter using Mininet
Lab 5: Interconnecting a Datacenter using MininetLab 5: Interconnecting a Datacenter using Mininet
Lab 5: Interconnecting a Datacenter using MininetZubair Nabi
 

Mehr von Zubair Nabi (20)

AOS Lab 12: Network Communication
AOS Lab 12: Network CommunicationAOS Lab 12: Network Communication
AOS Lab 12: Network Communication
 
AOS Lab 11: Virtualization
AOS Lab 11: VirtualizationAOS Lab 11: Virtualization
AOS Lab 11: Virtualization
 
AOS Lab 10: File system -- Inodes and beyond
AOS Lab 10: File system -- Inodes and beyondAOS Lab 10: File system -- Inodes and beyond
AOS Lab 10: File system -- Inodes and beyond
 
AOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 9: File system -- Of buffers, logs, and blocksAOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 9: File system -- Of buffers, logs, and blocks
 
AOS Lab 8: Interrupts and Device Drivers
AOS Lab 8: Interrupts and Device DriversAOS Lab 8: Interrupts and Device Drivers
AOS Lab 8: Interrupts and Device Drivers
 
AOS Lab 7: Page tables
AOS Lab 7: Page tablesAOS Lab 7: Page tables
AOS Lab 7: Page tables
 
AOS Lab 6: Scheduling
AOS Lab 6: SchedulingAOS Lab 6: Scheduling
AOS Lab 6: Scheduling
 
AOS Lab 5: System calls
AOS Lab 5: System callsAOS Lab 5: System calls
AOS Lab 5: System calls
 
AOS Lab 4: If you liked it, then you should have put a “lock” on it
AOS Lab 4: If you liked it, then you should have put a “lock” on itAOS Lab 4: If you liked it, then you should have put a “lock” on it
AOS Lab 4: If you liked it, then you should have put a “lock” on it
 
AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!
 
AOS Lab 2: Hello, xv6!
AOS Lab 2: Hello, xv6!AOS Lab 2: Hello, xv6!
AOS Lab 2: Hello, xv6!
 
AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data Stack
 
Raabta: Low-cost Video Conferencing for the Developing World
Raabta: Low-cost Video Conferencing for the Developing WorldRaabta: Low-cost Video Conferencing for the Developing World
Raabta: Low-cost Video Conferencing for the Developing World
 
The Anatomy of Web Censorship in Pakistan
The Anatomy of Web Censorship in PakistanThe Anatomy of Web Censorship in Pakistan
The Anatomy of Web Censorship in Pakistan
 
MapReduce and DBMS Hybrids
MapReduce and DBMS HybridsMapReduce and DBMS Hybrids
MapReduce and DBMS Hybrids
 
Topic 15: Datacenter Design and Networking
Topic 15: Datacenter Design and NetworkingTopic 15: Datacenter Design and Networking
Topic 15: Datacenter Design and Networking
 
Topic 14: Operating Systems and Virtualization
Topic 14: Operating Systems and VirtualizationTopic 14: Operating Systems and Virtualization
Topic 14: Operating Systems and Virtualization
 
Topic 13: Cloud Stacks
Topic 13: Cloud StacksTopic 13: Cloud Stacks
Topic 13: Cloud Stacks
 
Lab 5: Interconnecting a Datacenter using Mininet
Lab 5: Interconnecting a Datacenter using MininetLab 5: Interconnecting a Datacenter using Mininet
Lab 5: Interconnecting a Datacenter using Mininet
 

Kürzlich hochgeladen

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 

Kürzlich hochgeladen (20)

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 

Topic 7: Shortcomings in the MapReduce Paradigm

  • 1. 7: Shortcomings in the MapReduce Paradigm Zubair Nabi zubair.nabi@itu.edu.pk April 19, 2013 Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 1 / 31
  • 2. Outline 1 Hadoop everywhere! 2 Skew 3 Heterogeneous Environment 4 Low-level Programming Interface 5 Strictly Batch-processing 6 Single-input/single output and Two-phase 7 Iterative and Recursive Applications 8 Incremental Computation Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 2 / 31
  • 3. Outline 1 Hadoop everywhere! 2 Skew 3 Heterogeneous Environment 4 Low-level Programming Interface 5 Strictly Batch-processing 6 Single-input/single output and Two-phase 7 Iterative and Recursive Applications 8 Incremental Computation Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 3 / 31
  • 4. Users1 Adobe: Several areas from social services to unstructured data storage and processing 1 http://wiki.apache.org/hadoop/PoweredBy Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 4 / 31
  • 5. Users1 Adobe: Several areas from social services to unstructured data storage and processing eBay: 532 nodes cluster storing 5.3PB of data 1 http://wiki.apache.org/hadoop/PoweredBy Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 4 / 31
  • 6. Users1 Adobe: Several areas from social services to unstructured data storage and processing eBay: 532 nodes cluster storing 5.3PB of data Facebook: Used for reporting/analytics; one cluster with 1100 nodes (12PB) and another with 300 nodes (3PB) 1 http://wiki.apache.org/hadoop/PoweredBy Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 4 / 31
  • 7. Users1 Adobe: Several areas from social services to unstructured data storage and processing eBay: 532 nodes cluster storing 5.3PB of data Facebook: Used for reporting/analytics; one cluster with 1100 nodes (12PB) and another with 300 nodes (3PB) LinkedIn: 3 clusters with collectively 4000 nodes 1 http://wiki.apache.org/hadoop/PoweredBy Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 4 / 31
  • 8. Users1 Adobe: Several areas from social services to unstructured data storage and processing eBay: 532 nodes cluster storing 5.3PB of data Facebook: Used for reporting/analytics; one cluster with 1100 nodes (12PB) and another with 300 nodes (3PB) LinkedIn: 3 clusters with collectively 4000 nodes Twitter: To store and process Tweets and log files 1 http://wiki.apache.org/hadoop/PoweredBy Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 4 / 31
  • 9. Users1 Adobe: Several areas from social services to unstructured data storage and processing eBay: 532 nodes cluster storing 5.3PB of data Facebook: Used for reporting/analytics; one cluster with 1100 nodes (12PB) and another with 300 nodes (3PB) LinkedIn: 3 clusters with collectively 4000 nodes Twitter: To store and process Tweets and log files Yahoo!: Multiple clusters with collectively 40000 nodes; largest cluster has 4500 nodes! 1 http://wiki.apache.org/hadoop/PoweredBy Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 4 / 31
  • 10. But all is not well Over the years, Hadoop has become a one-size-fits-all solution to data intensive computing Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 5 / 31
  • 11. But all is not well Over the years, Hadoop has become a one-size-fits-all solution to data intensive computing As early as 2008, David DeWitt and Michael Stonebraker asserted that MapReduce was a “major step backwards” for data intensive computing Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 5 / 31
  • 12. But all is not well Over the years, Hadoop has become a one-size-fits-all solution to data intensive computing As early as 2008, David DeWitt and Michael Stonebraker asserted that MapReduce was a “major step backwards” for data intensive computing They opined: MapReduce is a major step backwards in database access because it negates schema and is too low-level Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 5 / 31
  • 13. But all is not well Over the years, Hadoop has become a one-size-fits-all solution to data intensive computing As early as 2008, David DeWitt and Michael Stonebraker asserted that MapReduce was a “major step backwards” for data intensive computing They opined: MapReduce is a major step backwards in database access because it negates schema and is too low-level It has a sub-optimal implementation as it, makes use of brute force instead of indexing, does not handle skew, and uses data pull instead of push Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 5 / 31
  • 14. But all is not well Over the years, Hadoop has become a one-size-fits-all solution to data intensive computing As early as 2008, David DeWitt and Michael Stonebraker asserted that MapReduce was a “major step backwards” for data intensive computing They opined: MapReduce is a major step backwards in database access because it negates schema and is too low-level It has a sub-optimal implementation as it, makes use of brute force instead of indexing, does not handle skew, and uses data pull instead of push It is just rehashing old database concepts Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 5 / 31
  • 15. But all is not well Over the years, Hadoop has become a one-size-fits-all solution to data intensive computing As early as 2008, David DeWitt and Michael Stonebraker asserted that MapReduce was a “major step backwards” for data intensive computing They opined: MapReduce is a major step backwards in database access because it negates schema and is too low-level It has a sub-optimal implementation as it, makes use of brute force instead of indexing, does not handle skew, and uses data pull instead of push It is just rehashing old database concepts It is missing most DBMS functionalities, such as updates, transactions, etc. Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 5 / 31
  • 16. But all is not well Over the years, Hadoop has become a one-size-fits-all solution to data intensive computing As early as 2008, David DeWitt and Michael Stonebraker asserted that MapReduce was a “major step backwards” for data intensive computing They opined: MapReduce is a major step backwards in database access because it negates schema and is too low-level It has a sub-optimal implementation as it, makes use of brute force instead of indexing, does not handle skew, and uses data pull instead of push It is just rehashing old database concepts It is missing most DBMS functionalities, such as updates, transactions, etc. It is incompatible with DBMS tools, such as human visualization, data replication from one DBMS to another, etc. Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 5 / 31
  • 17. Outline 1 Hadoop everywhere! 2 Skew 3 Heterogeneous Environment 4 Low-level Programming Interface 5 Strictly Batch-processing 6 Single-input/single output and Two-phase 7 Iterative and Recursive Applications 8 Incremental Computation Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 6 / 31
  • 18. Introduction Due to the uneven distribution of intermediate key/value pairs some reduce workers end up doing more work Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 7 / 31
  • 19. Introduction Due to the uneven distribution of intermediate key/value pairs some reduce workers end up doing more work Such reducers become “stragglers” Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 7 / 31
  • 20. Introduction Due to the uneven distribution of intermediate key/value pairs some reduce workers end up doing more work Such reducers become “stragglers” A large number of real-world applications follow long-tailed distributions (Zipf-like) Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 7 / 31
  • 21. Wordcount and skew Text corpora have a Zipfian skew, i.e. a very small number of words account for most occurrences Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 8 / 31
  • 22. Wordcount and skew Text corpora have a Zipfian skew, i.e. a very small number of words account for most occurrences For instance, of 242,758 words in the dataset used to generate the figure, the 10, 100, and 1000 most frequent words account for 22%, 43%, and 64% of the entire set Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 8 / 31
  • 23. Wordcount and skew Text corpora have a Zipfian skew, i.e. a very small number of words account for most occurrences For instance, of 242,758 words in the dataset used to generate the figure, the 10, 100, and 1000 most frequent words account for 22%, 43%, and 64% of the entire set Such skewed intermediate results lead to uneven distribution of workload across reduce workers Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 8 / 31
  • 24. Page rank and skew Even Google’s implementation of its core PageRank algorithm is plagued by the skew problem Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 9 / 31
  • 25. Page rank and skew Even Google’s implementation of its core PageRank algorithm is plagued by the skew problem Google uses PageRank to calculate a webpage’s relevance for a given search query Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 9 / 31
  • 26. Page rank and skew Even Google’s implementation of its core PageRank algorithm is plagued by the skew problem Google uses PageRank to calculate a webpage’s relevance for a given search query Map: Emit the outlinks for each page Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 9 / 31
  • 27. Page rank and skew Even Google’s implementation of its core PageRank algorithm is plagued by the skew problem Google uses PageRank to calculate a webpage’s relevance for a given search query Map: Emit the outlinks for each page Reduce: Calculate rank per page Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 9 / 31
  • 28. Page rank and skew Even Google’s implementation of its core PageRank algorithm is plagued by the skew problem Google uses PageRank to calculate a webpage’s relevance for a given search query Map: Emit the outlinks for each page Reduce: Calculate rank per page The skew in intermediate data exists due to the huge disparity in the number of incoming links across pages on the Internet Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 9 / 31
  • 29. Page rank and skew Even Google’s implementation of its core PageRank algorithm is plagued by the skew problem Google uses PageRank to calculate a webpage’s relevance for a given search query Map: Emit the outlinks for each page Reduce: Calculate rank per page The skew in intermediate data exists due to the huge disparity in the number of incoming links across pages on the Internet The scale of the problem is evident when we consider the fact that Google currently indexes more than 25 billion webpages with skewed links Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 9 / 31
  • 30. Page rank and skew Even Google’s implementation of its core PageRank algorithm is plagued by the skew problem Google uses PageRank to calculate a webpage’s relevance for a given search query Map: Emit the outlinks for each page Reduce: Calculate rank per page The skew in intermediate data exists due to the huge disparity in the number of incoming links across pages on the Internet The scale of the problem is evident when we consider the fact that Google currently indexes more than 25 billion webpages with skewed links For instance, Facebook has 49,376,609 incoming links (at the time of writing) while the personal webpage of the presenter only has 4 (=)) Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 9 / 31
  • 31. Zipf distributions are everywhere Followed by Inverted Indexing, Publish/Subscribe systems, fraud detection, and various clustering algorithms Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 10 / 31
  • 32. Zipf distributions are everywhere Followed by Inverted Indexing, Publish/Subscribe systems, fraud detection, and various clustering algorithms P2P systems have Zipf distributions too both in terms of users and content Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 10 / 31
  • 33. Zipf distributions are everywhere Followed by Inverted Indexing, Publish/Subscribe systems, fraud detection, and various clustering algorithms P2P systems have Zipf distributions too both in terms of users and content Web caching schemes as well as email and social networks Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 10 / 31
  • 34. Outline 1 Hadoop everywhere! 2 Skew 3 Heterogeneous Environment 4 Low-level Programming Interface 5 Strictly Batch-processing 6 Single-input/single output and Two-phase 7 Iterative and Recursive Applications 8 Incremental Computation Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 11 / 31
  • 35. Introduction In the MapReduce model, tasks which take exceptionally long are labelled “stragglers” Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 12 / 31
  • 36. Introduction In the MapReduce model, tasks which take exceptionally long are labelled “stragglers” The framework launches a speculative copy of each straggler on another machine expecting it to finish quickly Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 12 / 31
  • 37. Introduction In the MapReduce model, tasks which take exceptionally long are labelled “stragglers” The framework launches a speculative copy of each straggler on another machine expecting it to finish quickly Without this, the overall job completion time is dictated by the slowest straggler Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 12 / 31
  • 38. Introduction In the MapReduce model, tasks which take exceptionally long are labelled “stragglers” The framework launches a speculative copy of each straggler on another machine expecting it to finish quickly Without this, the overall job completion time is dictated by the slowest straggler On Google clusters, speculative execution can reduce job completion by 44% Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 12 / 31
  • 39. Hadoop’s assumptions regarding speculation 1 All nodes are equal, i.e. they can perform work at more or less the same rate Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 13 / 31
  • 40. Hadoop’s assumptions regarding speculation 1 All nodes are equal, i.e. they can perform work at more or less the same rate 2 Tasks make progress at a constant rate throughout their lifetime Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 13 / 31
  • 41. Hadoop’s assumptions regarding speculation 1 All nodes are equal, i.e. they can perform work at more or less the same rate 2 Tasks make progress at a constant rate throughout their lifetime 3 There is no cost of launching a speculative cost on an otherwise idle slot/node Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 13 / 31
  • 42. Hadoop’s assumptions regarding speculation 1 All nodes are equal, i.e. they can perform work at more or less the same rate 2 Tasks make progress at a constant rate throughout their lifetime 3 There is no cost of launching a speculative cost on an otherwise idle slot/node 4 The progress score of a task captures the fraction of its total work that it has done. Specifically, the shuffle, merge, and reduce logic phases each take roughly 1/3 of the total time Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 13 / 31
  • 43. Hadoop’s assumptions regarding speculation 1 All nodes are equal, i.e. they can perform work at more or less the same rate 2 Tasks make progress at a constant rate throughout their lifetime 3 There is no cost of launching a speculative cost on an otherwise idle slot/node 4 The progress score of a task captures the fraction of its total work that it has done. Specifically, the shuffle, merge, and reduce logic phases each take roughly 1/3 of the total time 5 As tasks finish in waves, a task with a low progress score is most likely a straggler Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 13 / 31
  • 44. Hadoop’s assumptions regarding speculation 1 All nodes are equal, i.e. they can perform work at more or less the same rate 2 Tasks make progress at a constant rate throughout their lifetime 3 There is no cost of launching a speculative cost on an otherwise idle slot/node 4 The progress score of a task captures the fraction of its total work that it has done. Specifically, the shuffle, merge, and reduce logic phases each take roughly 1/3 of the total time 5 As tasks finish in waves, a task with a low progress score is most likely a straggler 6 Tasks within the same phase, require roughly the same amount of work Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 13 / 31
  • 45. Assumptions 1 and 2 1 All nodes are equal, i.e. they can perform work at more or less the same rate Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 14 / 31
  • 46. Assumptions 1 and 2 1 All nodes are equal, i.e. they can perform work at more or less the same rate 2 Tasks make progress at a constant rate throughout their lifetime Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 14 / 31
  • 47. Assumptions 1 and 2 1 All nodes are equal, i.e. they can perform work at more or less the same rate 2 Tasks make progress at a constant rate throughout their lifetime Both breakdown in heterogeneous environments which consist of multiple generations of hardware Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 14 / 31
  • 48. Assumption 3 3 There is no cost of launching a speculative cost on an otherwise idle slot/node Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 15 / 31
  • 49. Assumption 3 3 There is no cost of launching a speculative cost on an otherwise idle slot/node Breaks down due to shared resources Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 15 / 31
  • 50. Assumption 4 4 The progress score of a task captures the fraction of its total work that it has done. Specifically, the shuffle, merge, and reduce logic phases each take roughly 1/3 of the total time Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 16 / 31
  • 51. Assumption 4 4 The progress score of a task captures the fraction of its total work that it has done. Specifically, the shuffle, merge, and reduce logic phases each take roughly 1/3 of the total time Breaks down due the fact that in reduce tasks the shuffle phase takes the longest time as opposed to the other 2 Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 16 / 31
  • 52. Assumption 5 5 As tasks finish in waves, a task with a low progress score is most likely a straggler Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 17 / 31
  • 53. Assumption 5 5 As tasks finish in waves, a task with a low progress score is most likely a straggler Breaks down due to the fact that task completion is spread across time due to uneven workload Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 17 / 31
  • 54. Assumption 6 6 Tasks within the same phase, require roughly the same amount of work Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 18 / 31
  • 55. Assumption 6 6 Tasks within the same phase, require roughly the same amount of work Breaks down due to data skew Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 18 / 31
  • 56. Outline 1 Hadoop everywhere! 2 Skew 3 Heterogeneous Environment 4 Low-level Programming Interface 5 Strictly Batch-processing 6 Single-input/single output and Two-phase 7 Iterative and Recursive Applications 8 Incremental Computation Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 19 / 31
  • 57. Introduction The one-input, two-stage data flow is extremely rigid for ad-hoc analysis of large datasets Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 20 / 31
  • 58. Introduction The one-input, two-stage data flow is extremely rigid for ad-hoc analysis of large datasets Hacks need to be put into place for different data flow, such as joins or multiple stages Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 20 / 31
  • 59. Introduction The one-input, two-stage data flow is extremely rigid for ad-hoc analysis of large datasets Hacks need to be put into place for different data flow, such as joins or multiple stages Custom code has to be written for common DB operations, such as projection and filtering Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 20 / 31
  • 60. Introduction The one-input, two-stage data flow is extremely rigid for ad-hoc analysis of large datasets Hacks need to be put into place for different data flow, such as joins or multiple stages Custom code has to be written for common DB operations, such as projection and filtering The opaque nature of map and reduce functions makes it impossible to perform optimizations, such as operator reordering Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 20 / 31
  • 61. Outline 1 Hadoop everywhere! 2 Skew 3 Heterogeneous Environment 4 Low-level Programming Interface 5 Strictly Batch-processing 6 Single-input/single output and Two-phase 7 Iterative and Recursive Applications 8 Incremental Computation Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 21 / 31
  • 62. Introduction In case of MapReduce, the entire output of a map or a reduce task needs to be materialized to local storage before the next stage can commence Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 22 / 31
  • 63. Introduction In case of MapReduce, the entire output of a map or a reduce task needs to be materialized to local storage before the next stage can commence Simplifies fault-tolerance Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 22 / 31
  • 64. Introduction In case of MapReduce, the entire output of a map or a reduce task needs to be materialized to local storage before the next stage can commence Simplifies fault-tolerance Reducers have to pull their input instead of the mappers pushing it Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 22 / 31
  • 65. Introduction In case of MapReduce, the entire output of a map or a reduce task needs to be materialized to local storage before the next stage can commence Simplifies fault-tolerance Reducers have to pull their input instead of the mappers pushing it Negates pipelining, result estimation, and continuous queries (stream processing) Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 22 / 31
  • 66. Outline 1 Hadoop everywhere! 2 Skew 3 Heterogeneous Environment 4 Low-level Programming Interface 5 Strictly Batch-processing 6 Single-input/single output and Two-phase 7 Iterative and Recursive Applications 8 Incremental Computation Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 23 / 31
  • 67. Introduction 1 Not all applications can be broken down into just two-phases, such as complex SQL-like queries Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 24 / 31
  • 68. Introduction 1 Not all applications can be broken down into just two-phases, such as complex SQL-like queries 2 Tasks take in just one input and produce one output Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 24 / 31
  • 69. Outline 1 Hadoop everywhere! 2 Skew 3 Heterogeneous Environment 4 Low-level Programming Interface 5 Strictly Batch-processing 6 Single-input/single output and Two-phase 7 Iterative and Recursive Applications 8 Incremental Computation Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 25 / 31
  • 70. Introduction 1 Hadoop is widely employed for iterative computations Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 26 / 31
  • 71. Introduction 1 Hadoop is widely employed for iterative computations 2 For machine learning applications, the Apache Mahout library is used atop Hadoop Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 26 / 31
  • 72. Introduction 1 Hadoop is widely employed for iterative computations 2 For machine learning applications, the Apache Mahout library is used atop Hadoop 3 Mahout uses an external driver program to submit multiple jobs to Hadoop and perform a convergence test Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 26 / 31
  • 73. Introduction 1 Hadoop is widely employed for iterative computations 2 For machine learning applications, the Apache Mahout library is used atop Hadoop 3 Mahout uses an external driver program to submit multiple jobs to Hadoop and perform a convergence test 4 No fault-tolerance and overhead of job submission Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 26 / 31
  • 74. Introduction 1 Hadoop is widely employed for iterative computations 2 For machine learning applications, the Apache Mahout library is used atop Hadoop 3 Mahout uses an external driver program to submit multiple jobs to Hadoop and perform a convergence test 4 No fault-tolerance and overhead of job submission 5 Loop-invariant data is materialized to storage Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 26 / 31
  • 75. Outline 1 Hadoop everywhere! 2 Skew 3 Heterogeneous Environment 4 Low-level Programming Interface 5 Strictly Batch-processing 6 Single-input/single output and Two-phase 7 Iterative and Recursive Applications 8 Incremental Computation Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 27 / 31
  • 76. Introduction 1 Most workloads processed by MapReduce are incremental by nature, i.e. MapReduce jobs often run repeatedly with small changes in their input Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 28 / 31
  • 77. Introduction 1 Most workloads processed by MapReduce are incremental by nature, i.e. MapReduce jobs often run repeatedly with small changes in their input 2 For instance, most iterations of PageRank run with very small modifications Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 28 / 31
  • 78. Introduction 1 Most workloads processed by MapReduce are incremental by nature, i.e. MapReduce jobs often run repeatedly with small changes in their input 2 For instance, most iterations of PageRank run with very small modifications 3 Unfortunately, even with a small change in input, MapReduce re-performs the entire computation Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 28 / 31
  • 79. References 1 MapReduce: A major step backwards: http://homes.cs.washington.edu/~billhowe/ mapreduce_a_major_step_backwards.html 2 Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, and Ion Stoica. 2008. Improving MapReduce performance in heterogeneous environments. In Proceedings of the 8th USENIX conference on Operating systems design and implementation (OSDI’08). USENIX Association, Berkeley, CA, USA, 29-42. 3 Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. 2008. Pig latin: a not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data (SIGMOD ’08). ACM, New York, NY, USA, 1099-1110. Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 29 / 31
  • 80. References (2) 4 Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein, Khaled Elmeleegy, and Russell Sears. 2010. MapReduce online. In Proceedings of the 7th USENIX conference on Networked systems design and implementation (NSDI’10). USENIX Association, Berkeley, CA, USA. 5 Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. 2007. Dryad: distributed data-parallel programs from sequential building blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007 (EuroSys ’07). ACM, New York, NY, USA, 59-72. 6 Derek G. Murray, Malte Schwarzkopf, Christopher Smowton, Steven Smith, Anil Madhavapeddy, and Steven Hand. 2011. CIEL: a universal execution engine for distributed data-flow computing. In Proceedings of the 8th USENIX conference on Networked systems design and implementation (NSDI’11). USENIX Association, Berkeley, CA, USA. Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 30 / 31
  • 81. References (3) 7 Pramod Bhatotia, Alexander Wieder, Rodrigo Rodrigues, Umut A. Acar, and Rafael Pasquin. 2011. Incoop: MapReduce for incremental computations. In Proceedings of the 2nd ACM Symposium on Cloud Computing (SOCC ’11). ACM, New York, NY, USA. Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 31 / 31