SlideShare ist ein Scribd-Unternehmen logo
1 von 90
Downloaden Sie, um offline zu lesen
5: MapReduce Theory and Implementation

                                       Zubair Nabi

                             zubair.nabi@itu.edu.pk


                                     April 18, 2013




Zubair Nabi          5: MapReduce Theory and Implementation   April 18, 2013   1 / 34
Outline


  1    Introduction


  2    Programming Model


  3    Implementation


  4    Refinements


  5    Hadoop




  Zubair Nabi           5: MapReduce Theory and Implementation   April 18, 2013   2 / 34
Outline


  1    Introduction


  2    Programming Model


  3    Implementation


  4    Refinements


  5    Hadoop




  Zubair Nabi           5: MapReduce Theory and Implementation   April 18, 2013   3 / 34
Common computations at Google



          Process large amounts of data generated from crawled documents,
          web request logs, etc.




  Zubair Nabi           5: MapReduce Theory and Implementation   April 18, 2013   4 / 34
Common computations at Google



          Process large amounts of data generated from crawled documents,
          web request logs, etc.
          Compute inverted index, graph structure of web documents,
          summaries of pages crawled per host, etc.




  Zubair Nabi            5: MapReduce Theory and Implementation   April 18, 2013   4 / 34
Common computations at Google



          Process large amounts of data generated from crawled documents,
          web request logs, etc.
          Compute inverted index, graph structure of web documents,
          summaries of pages crawled per host, etc.
          Common properties:
                1   Computation is conceptually simple and is distributed across hundreds
                    or thousands of machines to leverage parallelism




  Zubair Nabi                  5: MapReduce Theory and Implementation        April 18, 2013   4 / 34
Common computations at Google



          Process large amounts of data generated from crawled documents,
          web request logs, etc.
          Compute inverted index, graph structure of web documents,
          summaries of pages crawled per host, etc.
          Common properties:
                1 Computation is conceptually simple and is distributed across hundreds
                  or thousands of machines to leverage parallelism
                2 Input data is large




  Zubair Nabi                 5: MapReduce Theory and Implementation       April 18, 2013   4 / 34
Common computations at Google



          Process large amounts of data generated from crawled documents,
          web request logs, etc.
          Compute inverted index, graph structure of web documents,
          summaries of pages crawled per host, etc.
          Common properties:
                1 Computation is conceptually simple and is distributed across hundreds
                  or thousands of machines to leverage parallelism
                2 Input data is large
                3 The original simple computation is made complex by system-level code
                  to deal with issues of work assignment and distribution, and
                  fault-tolerance




  Zubair Nabi                 5: MapReduce Theory and Implementation       April 18, 2013   4 / 34
Enter MapReduce


          Based on the insights mentioned in the previous slide, 2 Google
          Engineers, Jeff Dean and Sanjay Ghemawat, in 2004 designed
          MapReduce




  Zubair Nabi            5: MapReduce Theory and Implementation    April 18, 2013   5 / 34
Enter MapReduce


          Based on the insights mentioned in the previous slide, 2 Google
          Engineers, Jeff Dean and Sanjay Ghemawat, in 2004 designed
          MapReduce
                Abstraction that helps the programmer express simple computations




  Zubair Nabi              5: MapReduce Theory and Implementation       April 18, 2013   5 / 34
Enter MapReduce


          Based on the insights mentioned in the previous slide, 2 Google
          Engineers, Jeff Dean and Sanjay Ghemawat, in 2004 designed
          MapReduce
                Abstraction that helps the programmer express simple computations
                Hides the gory details of parallelization, fault-tolerance, data distribution,
                and load balancing




  Zubair Nabi               5: MapReduce Theory and Implementation             April 18, 2013   5 / 34
Enter MapReduce


          Based on the insights mentioned in the previous slide, 2 Google
          Engineers, Jeff Dean and Sanjay Ghemawat, in 2004 designed
          MapReduce
                Abstraction that helps the programmer express simple computations
                Hides the gory details of parallelization, fault-tolerance, data distribution,
                and load balancing
                Relies on user-provided map and reduce primitives present in functional
                languages




  Zubair Nabi               5: MapReduce Theory and Implementation             April 18, 2013   5 / 34
Enter MapReduce


          Based on the insights mentioned in the previous slide, 2 Google
          Engineers, Jeff Dean and Sanjay Ghemawat, in 2004 designed
          MapReduce
                Abstraction that helps the programmer express simple computations
                Hides the gory details of parallelization, fault-tolerance, data distribution,
                and load balancing
                Relies on user-provided map and reduce primitives present in functional
                languages
          Leverages one key insight: Most of the computation at Google involved
          applying a map operator to each logical record in the input dataset to
          obtain a set of intermediate key/value pairs and then applying a reduce
          operation to all values with the same key, for aggregation




  Zubair Nabi               5: MapReduce Theory and Implementation             April 18, 2013   5 / 34
Zubair Nabi   5: MapReduce Theory and Implementation   April 18, 2013   6 / 34
Outline


  1    Introduction


  2    Programming Model


  3    Implementation


  4    Refinements


  5    Hadoop




  Zubair Nabi           5: MapReduce Theory and Implementation   April 18, 2013   7 / 34
Programming Model




          Input: Set of key/value pairs




  Zubair Nabi             5: MapReduce Theory and Implementation   April 18, 2013   8 / 34
Programming Model




          Input: Set of key/value pairs
          Output: Set of key/value pairs




  Zubair Nabi             5: MapReduce Theory and Implementation   April 18, 2013   8 / 34
Programming Model




          Input: Set of key/value pairs
          Output: Set of key/value pairs
          The user provides the entire computation in the form of two functions:
          map and reduce




  Zubair Nabi             5: MapReduce Theory and Implementation     April 18, 2013   8 / 34
User-defined functions




     1    Map
                Takes an input pair and produces a set of intermediate key/value pairs




  Zubair Nabi              5: MapReduce Theory and Implementation         April 18, 2013   9 / 34
User-defined functions




     1    Map
                Takes an input pair and produces a set of intermediate key/value pairs
                The framework groups together the intermediate values by key for
                consumption by the Reduce




  Zubair Nabi              5: MapReduce Theory and Implementation         April 18, 2013   9 / 34
User-defined functions




     1    Map
                Takes an input pair and produces a set of intermediate key/value pairs
                The framework groups together the intermediate values by key for
                consumption by the Reduce
     2    Reduce
                Takes as input a key and a list of associated values




  Zubair Nabi              5: MapReduce Theory and Implementation         April 18, 2013   9 / 34
User-defined functions




     1    Map
                Takes an input pair and produces a set of intermediate key/value pairs
                The framework groups together the intermediate values by key for
                consumption by the Reduce
     2    Reduce
                Takes as input a key and a list of associated values
                In the common case, it merges these values to result in a smaller set of
                values




  Zubair Nabi              5: MapReduce Theory and Implementation          April 18, 2013   9 / 34
Example: Word Count




 Counting the occurrence of each word in a large collection of documents




  Zubair Nabi        5: MapReduce Theory and Implementation   April 18, 2013   10 / 34
Example: Word Count




 Counting the occurrence of each word in a large collection of documents
     1    Map
                Emits each word and the value 1




  Zubair Nabi              5: MapReduce Theory and Implementation   April 18, 2013   10 / 34
Example: Word Count




 Counting the occurrence of each word in a large collection of documents
     1    Map
                Emits each word and the value 1
     2    Reduce
                Sums together all counts emitted for a particular word




  Zubair Nabi              5: MapReduce Theory and Implementation        April 18, 2013   10 / 34
Example: Word Count(2)

 1   map( String key , String value ):

 2     // key: document name

 3     // value : document contents

 4     for each word w in value:

 5        EmitIntermediate (w, "1");

 6
 7   reduce ( String key , Iterator values ):

 8     // key: a word

 9     // values : a list of counts

10     int result = 0;

11     for each v in values :

12        result += ParseInt (v);

13     Emit( AsString ( result ));




     Zubair Nabi                     5: MapReduce Theory and Implementation   April 18, 2013   11 / 34
Types




 User-supplied map and reduce functions have associated types
     1    Map
                map(k1, v1) → list(k2, v2)




  Zubair Nabi          5: MapReduce Theory and Implementation   April 18, 2013   12 / 34
Types




 User-supplied map and reduce functions have associated types
     1    Map
                map(k1, v1) → list(k2, v2)
     2    Reduce
              reduce(k2, list(v2)) → list(v2)




  Zubair Nabi          5: MapReduce Theory and Implementation   April 18, 2013   12 / 34
More applications



          Distributed Grep
                1   Map
                          Emits a line if its matches a user-provided pattern
                2   Reduce
                          Identity function




  Zubair Nabi                    5: MapReduce Theory and Implementation         April 18, 2013   13 / 34
More applications



          Distributed Grep
                1   Map
                          Emits a line if its matches a user-provided pattern
                2   Reduce
                          Identity function
          Count of URL Access Frequency
                1   Map
                          Similar to Word Count map. Instead of words we have URLs
                2   Reduce
                          Similar to Word Count reduce




  Zubair Nabi                    5: MapReduce Theory and Implementation         April 18, 2013   13 / 34
More applications (2)



          Inverted Index
                1   Map
                          Emits a sequence of < word, document_ID >
                2   Reduce
                          Emits < word, list(document_ID) >




  Zubair Nabi                  5: MapReduce Theory and Implementation   April 18, 2013   14 / 34
More applications (2)



          Inverted Index
                1   Map
                          Emits a sequence of < word, document_ID >
                2   Reduce
                          Emits < word, list(document_ID) >
          Distributed Sort
                1   Map
                          Identity
                2   Reduce
                          Identity




  Zubair Nabi                    5: MapReduce Theory and Implementation   April 18, 2013   14 / 34
Outline


  1    Introduction


  2    Programming Model


  3    Implementation


  4    Refinements


  5    Hadoop




  Zubair Nabi           5: MapReduce Theory and Implementation   April 18, 2013   15 / 34
Cluster architecture

  A large cluster of shared-nothing commodity machines connected via
  Ethernet
          Each node is an x86 system running Linux with local memory




  Zubair Nabi           5: MapReduce Theory and Implementation   April 18, 2013   16 / 34
Cluster architecture

  A large cluster of shared-nothing commodity machines connected via
  Ethernet
          Each node is an x86 system running Linux with local memory
          Commodity networking hardware connected in the form of a tree
          topology




  Zubair Nabi           5: MapReduce Theory and Implementation   April 18, 2013   16 / 34
Cluster architecture

  A large cluster of shared-nothing commodity machines connected via
  Ethernet
          Each node is an x86 system running Linux with local memory
          Commodity networking hardware connected in the form of a tree
          topology
          As clusters consist of hundreds or thousands of machines, failure is
          pretty common




  Zubair Nabi            5: MapReduce Theory and Implementation     April 18, 2013   16 / 34
Cluster architecture

  A large cluster of shared-nothing commodity machines connected via
  Ethernet
          Each node is an x86 system running Linux with local memory
          Commodity networking hardware connected in the form of a tree
          topology
          As clusters consist of hundreds or thousands of machines, failure is
          pretty common
          Each machine consists of local hard-drives




  Zubair Nabi            5: MapReduce Theory and Implementation     April 18, 2013   16 / 34
Cluster architecture

  A large cluster of shared-nothing commodity machines connected via
  Ethernet
          Each node is an x86 system running Linux with local memory
          Commodity networking hardware connected in the form of a tree
          topology
          As clusters consist of hundreds or thousands of machines, failure is
          pretty common
          Each machine consists of local hard-drives
                Google Filesystem runs atop of these disks which employs replication to
                ensure availability and reliability




  Zubair Nabi              5: MapReduce Theory and Implementation        April 18, 2013   16 / 34
Cluster architecture

  A large cluster of shared-nothing commodity machines connected via
  Ethernet
          Each node is an x86 system running Linux with local memory
          Commodity networking hardware connected in the form of a tree
          topology
          As clusters consist of hundreds or thousands of machines, failure is
          pretty common
          Each machine consists of local hard-drives
                Google Filesystem runs atop of these disks which employs replication to
                ensure availability and reliability
          Jobs are submitted to a scheduler, which maps tasks within that job to
          available machines within the cluster



  Zubair Nabi              5: MapReduce Theory and Implementation        April 18, 2013   16 / 34
MapReduce architecture




     1    Master: In charge of all meta data, work scheduling and distribution,
          and job orchestration




  Zubair Nabi            5: MapReduce Theory and Implementation     April 18, 2013   17 / 34
MapReduce architecture




     1    Master: In charge of all meta data, work scheduling and distribution,
          and job orchestration
     2    Workers: Contain slots to execute map or reduce functions




  Zubair Nabi            5: MapReduce Theory and Implementation     April 18, 2013   17 / 34
Execution


     1    The user writes map and reduce functions and stitches together a
          MapReduce specification with the location of the input dataset, number
          of reduce tasks, and other attributes




  Zubair Nabi            5: MapReduce Theory and Implementation   April 18, 2013   18 / 34
Execution


     1    The user writes map and reduce functions and stitches together a
          MapReduce specification with the location of the input dataset, number
          of reduce tasks, and other attributes
     2    The master logically splits the input dataset into M splits, where
          M = (Input_dataset_size)/(GFS_block _size)




  Zubair Nabi             5: MapReduce Theory and Implementation      April 18, 2013   18 / 34
Execution


     1    The user writes map and reduce functions and stitches together a
          MapReduce specification with the location of the input dataset, number
          of reduce tasks, and other attributes
     2    The master logically splits the input dataset into M splits, where
          M = (Input_dataset_size)/(GFS_block _size)
                The GFS block size is typically a multiple of 64MB




  Zubair Nabi              5: MapReduce Theory and Implementation     April 18, 2013   18 / 34
Execution


     1    The user writes map and reduce functions and stitches together a
          MapReduce specification with the location of the input dataset, number
          of reduce tasks, and other attributes
     2    The master logically splits the input dataset into M splits, where
          M = (Input_dataset_size)/(GFS_block _size)
                The GFS block size is typically a multiple of 64MB
     3    It then earmarks M map tasks and assigns them to workers. Each
          worker has a configurable number of task slots. Each time a worker
          completes a task, the master assigns it more pending map tasks




  Zubair Nabi              5: MapReduce Theory and Implementation     April 18, 2013   18 / 34
Execution


     1    The user writes map and reduce functions and stitches together a
          MapReduce specification with the location of the input dataset, number
          of reduce tasks, and other attributes
     2    The master logically splits the input dataset into M splits, where
          M = (Input_dataset_size)/(GFS_block _size)
                The GFS block size is typically a multiple of 64MB
     3    It then earmarks M map tasks and assigns them to workers. Each
          worker has a configurable number of task slots. Each time a worker
          completes a task, the master assigns it more pending map tasks
     4    Once all map tasks have completed, the master assigns R reduce
          tasks to worker nodes




  Zubair Nabi              5: MapReduce Theory and Implementation     April 18, 2013   18 / 34
Mappers



    1    A map worker reads the contents of the input split that it has been
         assigned




 Zubair Nabi             5: MapReduce Theory and Implementation     April 18, 2013   19 / 34
Mappers



    1    A map worker reads the contents of the input split that it has been
         assigned
    2    It parses the file and converts it to key/value pairs and invokes the
         user-defined map function for each pair




 Zubair Nabi             5: MapReduce Theory and Implementation     April 18, 2013   19 / 34
Mappers



    1    A map worker reads the contents of the input split that it has been
         assigned
    2    It parses the file and converts it to key/value pairs and invokes the
         user-defined map function for each pair
    3    The intermediate key/value pairs after the application of the map logic
         are collected (buffered) in memory




 Zubair Nabi             5: MapReduce Theory and Implementation     April 18, 2013   19 / 34
Mappers



    1    A map worker reads the contents of the input split that it has been
         assigned
    2    It parses the file and converts it to key/value pairs and invokes the
         user-defined map function for each pair
    3    The intermediate key/value pairs after the application of the map logic
         are collected (buffered) in memory
    4    Once the buffered key/value pairs exceed a threshold they are written
         to local disk and partitioned (using a partitioning function) into R
         partitions. The location of each partition is passed to the master




 Zubair Nabi             5: MapReduce Theory and Implementation     April 18, 2013   19 / 34
Reducers



     1    A reduce worker gets locations of its input partitions from the master
          and uses HTTP requests to retrieve them




  Zubair Nabi             5: MapReduce Theory and Implementation     April 18, 2013   20 / 34
Reducers



     1    A reduce worker gets locations of its input partitions from the master
          and uses HTTP requests to retrieve them
     2    Once it has read all its input, it sorts it by key to group together all
          occurrences of the same key




  Zubair Nabi              5: MapReduce Theory and Implementation        April 18, 2013   20 / 34
Reducers



     1    A reduce worker gets locations of its input partitions from the master
          and uses HTTP requests to retrieve them
     2    Once it has read all its input, it sorts it by key to group together all
          occurrences of the same key
     3    It then invokes the user-defined reduce for each key and passes it the
          key and its associated values




  Zubair Nabi              5: MapReduce Theory and Implementation        April 18, 2013   20 / 34
Reducers



     1    A reduce worker gets locations of its input partitions from the master
          and uses HTTP requests to retrieve them
     2    Once it has read all its input, it sorts it by key to group together all
          occurrences of the same key
     3    It then invokes the user-defined reduce for each key and passes it the
          key and its associated values
     4    The key/value pairs generated after the application of the reduce logic
          are then written to a final output file, which is subsequently written to
          the distributed filesystem




  Zubair Nabi              5: MapReduce Theory and Implementation        April 18, 2013   20 / 34
Zubair Nabi   5: MapReduce Theory and Implementation   April 18, 2013   21 / 34
Book-keeping by the Master




  The master contains meta-data for all jobs running in the cluster




  Zubair Nabi          5: MapReduce Theory and Implementation     April 18, 2013   22 / 34
Book-keeping by the Master




  The master contains meta-data for all jobs running in the cluster
          For each map and reduce tasks, it stores the state (pending,
          in-progress, or completed) and the ID of the worker on which it is
          executing (in-progress state)




  Zubair Nabi             5: MapReduce Theory and Implementation     April 18, 2013   22 / 34
Book-keeping by the Master




  The master contains meta-data for all jobs running in the cluster
          For each map and reduce tasks, it stores the state (pending,
          in-progress, or completed) and the ID of the worker on which it is
          executing (in-progress state)
          It stores the locations and sizes of partitions for each map task




  Zubair Nabi             5: MapReduce Theory and Implementation      April 18, 2013   22 / 34
Fault-tolerance


  For large compute clusters, failures are the norm rather than the exception




  Zubair Nabi          5: MapReduce Theory and Implementation    April 18, 2013   23 / 34
Fault-tolerance


  For large compute clusters, failures are the norm rather than the exception
     1    Worker:
                Each worker sends a periodic heartbeat signal to the master




  Zubair Nabi              5: MapReduce Theory and Implementation       April 18, 2013   23 / 34
Fault-tolerance


  For large compute clusters, failures are the norm rather than the exception
     1    Worker:
                Each worker sends a periodic heartbeat signal to the master
                If the master does not receive a heartbeat from a worker in a certain
                amount of time, it marks the worker as failed




  Zubair Nabi              5: MapReduce Theory and Implementation         April 18, 2013   23 / 34
Fault-tolerance


  For large compute clusters, failures are the norm rather than the exception
     1    Worker:
                Each worker sends a periodic heartbeat signal to the master
                If the master does not receive a heartbeat from a worker in a certain
                amount of time, it marks the worker as failed
                In-progress map and reduce tasks are simply re-executed on other
                nodes. Same goes for completed map tasks (as their output is lost on
                machine failure)




  Zubair Nabi              5: MapReduce Theory and Implementation        April 18, 2013   23 / 34
Fault-tolerance


  For large compute clusters, failures are the norm rather than the exception
     1    Worker:
                Each worker sends a periodic heartbeat signal to the master
                If the master does not receive a heartbeat from a worker in a certain
                amount of time, it marks the worker as failed
                In-progress map and reduce tasks are simply re-executed on other
                nodes. Same goes for completed map tasks (as their output is lost on
                machine failure)
                Completed reduce tasks are not re-executed as their output resides on
                the distributed filesystem




  Zubair Nabi              5: MapReduce Theory and Implementation       April 18, 2013   23 / 34
Fault-tolerance


  For large compute clusters, failures are the norm rather than the exception
     1    Worker:
                Each worker sends a periodic heartbeat signal to the master
                If the master does not receive a heartbeat from a worker in a certain
                amount of time, it marks the worker as failed
                In-progress map and reduce tasks are simply re-executed on other
                nodes. Same goes for completed map tasks (as their output is lost on
                machine failure)
                Completed reduce tasks are not re-executed as their output resides on
                the distributed filesystem
     2    Master:
                The entire computation is marked as failed




  Zubair Nabi              5: MapReduce Theory and Implementation       April 18, 2013   23 / 34
Fault-tolerance


  For large compute clusters, failures are the norm rather than the exception
     1    Worker:
                Each worker sends a periodic heartbeat signal to the master
                If the master does not receive a heartbeat from a worker in a certain
                amount of time, it marks the worker as failed
                In-progress map and reduce tasks are simply re-executed on other
                nodes. Same goes for completed map tasks (as their output is lost on
                machine failure)
                Completed reduce tasks are not re-executed as their output resides on
                the distributed filesystem
     2    Master:
                The entire computation is marked as failed
                But simple to keep the master soft state and re-spawn




  Zubair Nabi              5: MapReduce Theory and Implementation       April 18, 2013   23 / 34
Locality




          Network bandwidth is a scare resource in typical clusters




  Zubair Nabi            5: MapReduce Theory and Implementation       April 18, 2013   24 / 34
Locality




          Network bandwidth is a scare resource in typical clusters
          GFS slices files into 64MB blocks and stores 3 replicas across the
          cluster




  Zubair Nabi            5: MapReduce Theory and Implementation       April 18, 2013   24 / 34
Locality




          Network bandwidth is a scare resource in typical clusters
          GFS slices files into 64MB blocks and stores 3 replicas across the
          cluster
          The master exploits this information by scheduling a map task near its
          input data. Preference is in the order, node-local, rack/switch-local, and
          any




  Zubair Nabi             5: MapReduce Theory and Implementation      April 18, 2013   24 / 34
Speculative re-execution



          Every now and then the entire computation is held-up by a “straggler”
          task




  Zubair Nabi            5: MapReduce Theory and Implementation    April 18, 2013   25 / 34
Speculative re-execution



          Every now and then the entire computation is held-up by a “straggler”
          task
          Stragglers can arise due to a number of reasons, such as machine
          load, network traffic, software/hardware bugs, etc.




  Zubair Nabi            5: MapReduce Theory and Implementation    April 18, 2013   25 / 34
Speculative re-execution



          Every now and then the entire computation is held-up by a “straggler”
          task
          Stragglers can arise due to a number of reasons, such as machine
          load, network traffic, software/hardware bugs, etc.
          To deal with stragglers, the master speculatively re-executes slow tasks
          on other machines




  Zubair Nabi            5: MapReduce Theory and Implementation     April 18, 2013   25 / 34
Speculative re-execution



          Every now and then the entire computation is held-up by a “straggler”
          task
          Stragglers can arise due to a number of reasons, such as machine
          load, network traffic, software/hardware bugs, etc.
          To deal with stragglers, the master speculatively re-executes slow tasks
          on other machines
          The task is marked as completed whenever the primary or the backup
          finishes its execution




  Zubair Nabi            5: MapReduce Theory and Implementation     April 18, 2013   25 / 34
Scalability




          Possible to run on multiple scales: from single nodes to data centers
          with tens of thousands of nodes




  Zubair Nabi            5: MapReduce Theory and Implementation     April 18, 2013   26 / 34
Scalability




          Possible to run on multiple scales: from single nodes to data centers
          with tens of thousands of nodes
          Nodes can be added/removed on the fly to scale up/down




  Zubair Nabi            5: MapReduce Theory and Implementation     April 18, 2013   26 / 34
Outline


  1    Introduction


  2    Programming Model


  3    Implementation


  4    Refinements


  5    Hadoop




  Zubair Nabi           5: MapReduce Theory and Implementation   April 18, 2013   27 / 34
Partitioning




          By default MapReduce uses hash partitioning to partition the key
          space
              hash(key) % R




  Zubair Nabi            5: MapReduce Theory and Implementation    April 18, 2013   28 / 34
Partitioning




          By default MapReduce uses hash partitioning to partition the key
          space
              hash(key) % R
          Optionally, the user can provide a custom partitioning function to say,
          negate skew or to ensure that certain keys always end up at a
          particular reduce worker




  Zubair Nabi             5: MapReduce Theory and Implementation     April 18, 2013   28 / 34
Combiner function




          For reduce functions which are commutative and associative, the user
          can additionally provide a combiner function which is applied to the
          output of the map for local merging




  Zubair Nabi            5: MapReduce Theory and Implementation   April 18, 2013   29 / 34
Combiner function




          For reduce functions which are commutative and associative, the user
          can additionally provide a combiner function which is applied to the
          output of the map for local merging
          Typically, the same reduce function is used as a combiner




  Zubair Nabi            5: MapReduce Theory and Implementation       April 18, 2013   29 / 34
Input/output formats




          By default, the library supports a number of input/output formats




  Zubair Nabi             5: MapReduce Theory and Implementation    April 18, 2013   30 / 34
Input/output formats




          By default, the library supports a number of input/output formats
                For instance, text as input and key/value pairs as output




  Zubair Nabi               5: MapReduce Theory and Implementation          April 18, 2013   30 / 34
Input/output formats




          By default, the library supports a number of input/output formats
                For instance, text as input and key/value pairs as output
          Optionally, the user can specify custom input readers and output
          writers




  Zubair Nabi               5: MapReduce Theory and Implementation          April 18, 2013   30 / 34
Input/output formats




          By default, the library supports a number of input/output formats
                For instance, text as input and key/value pairs as output
          Optionally, the user can specify custom input readers and output
          writers
                For instance, to read/write from/to a database




  Zubair Nabi               5: MapReduce Theory and Implementation          April 18, 2013   30 / 34
Zubair Nabi   5: MapReduce Theory and Implementation   April 18, 2013   31 / 34
Outline


  1    Introduction


  2    Programming Model


  3    Implementation


  4    Refinements


  5    Hadoop




  Zubair Nabi           5: MapReduce Theory and Implementation   April 18, 2013   32 / 34
Hadoop




         Open-source implementation of MapReduce, developed by Doug
         Cutting originally at Yahoo! in 2004




 Zubair Nabi           5: MapReduce Theory and Implementation   April 18, 2013   33 / 34
Hadoop




         Open-source implementation of MapReduce, developed by Doug
         Cutting originally at Yahoo! in 2004
         Now a top-level Apache open-source project




 Zubair Nabi           5: MapReduce Theory and Implementation   April 18, 2013   33 / 34
Hadoop




         Open-source implementation of MapReduce, developed by Doug
         Cutting originally at Yahoo! in 2004
         Now a top-level Apache open-source project
         Implemented in Java (Google’s in-house implementation is in C++)




 Zubair Nabi            5: MapReduce Theory and Implementation   April 18, 2013   33 / 34
Hadoop




         Open-source implementation of MapReduce, developed by Doug
         Cutting originally at Yahoo! in 2004
         Now a top-level Apache open-source project
         Implemented in Java (Google’s in-house implementation is in C++)
         Comes with an associated distributed filesystem, HDFS (clone of GFS)




 Zubair Nabi            5: MapReduce Theory and Implementation   April 18, 2013   33 / 34
References




          Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified
          data processing on large clusters. In Proceedings of the 6th
          Symposium on Operating Systems Design & Implementation -
          (OSDI’04), Vol. 6. USENIX Association, Berkeley, CA, USA.




  Zubair Nabi           5: MapReduce Theory and Implementation   April 18, 2013   34 / 34

Weitere ähnliche Inhalte

Ähnlich wie Topic 5: MapReduce Theory and Implementation

Topic 8: Enhancements and Alternative Architectures
Topic 8: Enhancements and Alternative ArchitecturesTopic 8: Enhancements and Alternative Architectures
Topic 8: Enhancements and Alternative ArchitecturesZubair Nabi
 
Topic 2: Cloud Computing Paradigms
Topic 2: Cloud Computing ParadigmsTopic 2: Cloud Computing Paradigms
Topic 2: Cloud Computing ParadigmsZubair Nabi
 
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big DataPingCAP
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce FrameworkBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce FrameworkMahantesh Angadi
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceMahantesh Angadi
 
Presented by Ahmed Abdulhakim Al-Absi - Scaling map reduce applications acro...
Presented by Ahmed Abdulhakim Al-Absi -  Scaling map reduce applications acro...Presented by Ahmed Abdulhakim Al-Absi -  Scaling map reduce applications acro...
Presented by Ahmed Abdulhakim Al-Absi - Scaling map reduce applications acro...Absi Ahmed
 
Fundamentals_of_GIS_Estoque.pdf
Fundamentals_of_GIS_Estoque.pdfFundamentals_of_GIS_Estoque.pdf
Fundamentals_of_GIS_Estoque.pdfmichael152973
 
Dream2Control paper review
Dream2Control paper reviewDream2Control paper review
Dream2Control paper reviewtaeseon ryu
 
How Google Does Big Data - DevNexus 2014
How Google Does Big Data - DevNexus 2014How Google Does Big Data - DevNexus 2014
How Google Does Big Data - DevNexus 2014James Chittenden
 
Reduce Side Joins
Reduce Side Joins Reduce Side Joins
Reduce Side Joins Edureka!
 
Dsm Presentation
Dsm PresentationDsm Presentation
Dsm Presentationrichoe
 
Hadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintHadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintijccsa
 
LinkedGeoData and GeoKnow
LinkedGeoData and GeoKnowLinkedGeoData and GeoKnow
LinkedGeoData and GeoKnowgeoknow
 
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...dbpublications
 
Internet-Based Geographical Information Systems for the Real Estate Marketing
Internet-Based Geographical Information Systems for the Real Estate MarketingInternet-Based Geographical Information Systems for the Real Estate Marketing
Internet-Based Geographical Information Systems for the Real Estate Marketingiosrjce
 
Application of GIS in Construction Management
Application of GIS in Construction ManagementApplication of GIS in Construction Management
Application of GIS in Construction ManagementKush Patel
 

Ähnlich wie Topic 5: MapReduce Theory and Implementation (20)

Topic 8: Enhancements and Alternative Architectures
Topic 8: Enhancements and Alternative ArchitecturesTopic 8: Enhancements and Alternative Architectures
Topic 8: Enhancements and Alternative Architectures
 
Topic 2: Cloud Computing Paradigms
Topic 2: Cloud Computing ParadigmsTopic 2: Cloud Computing Paradigms
Topic 2: Cloud Computing Paradigms
 
Topic 9: MR+
Topic 9: MR+Topic 9: MR+
Topic 9: MR+
 
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce FrameworkBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
 
Presented by Ahmed Abdulhakim Al-Absi - Scaling map reduce applications acro...
Presented by Ahmed Abdulhakim Al-Absi -  Scaling map reduce applications acro...Presented by Ahmed Abdulhakim Al-Absi -  Scaling map reduce applications acro...
Presented by Ahmed Abdulhakim Al-Absi - Scaling map reduce applications acro...
 
call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...
 
Fundamentals_of_GIS_Estoque.pdf
Fundamentals_of_GIS_Estoque.pdfFundamentals_of_GIS_Estoque.pdf
Fundamentals_of_GIS_Estoque.pdf
 
IS.pptx
IS.pptxIS.pptx
IS.pptx
 
Dream2Control paper review
Dream2Control paper reviewDream2Control paper review
Dream2Control paper review
 
How Google Does Big Data - DevNexus 2014
How Google Does Big Data - DevNexus 2014How Google Does Big Data - DevNexus 2014
How Google Does Big Data - DevNexus 2014
 
Reduce Side Joins
Reduce Side Joins Reduce Side Joins
Reduce Side Joins
 
Dsm Presentation
Dsm PresentationDsm Presentation
Dsm Presentation
 
Hadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintHadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraint
 
LinkedGeoData and GeoKnow
LinkedGeoData and GeoKnowLinkedGeoData and GeoKnow
LinkedGeoData and GeoKnow
 
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
 
H017235155
H017235155H017235155
H017235155
 
Internet-Based Geographical Information Systems for the Real Estate Marketing
Internet-Based Geographical Information Systems for the Real Estate MarketingInternet-Based Geographical Information Systems for the Real Estate Marketing
Internet-Based Geographical Information Systems for the Real Estate Marketing
 
Application of GIS in Construction Management
Application of GIS in Construction ManagementApplication of GIS in Construction Management
Application of GIS in Construction Management
 

Mehr von Zubair Nabi

AOS Lab 12: Network Communication
AOS Lab 12: Network CommunicationAOS Lab 12: Network Communication
AOS Lab 12: Network CommunicationZubair Nabi
 
AOS Lab 11: Virtualization
AOS Lab 11: VirtualizationAOS Lab 11: Virtualization
AOS Lab 11: VirtualizationZubair Nabi
 
AOS Lab 10: File system -- Inodes and beyond
AOS Lab 10: File system -- Inodes and beyondAOS Lab 10: File system -- Inodes and beyond
AOS Lab 10: File system -- Inodes and beyondZubair Nabi
 
AOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 9: File system -- Of buffers, logs, and blocksAOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 9: File system -- Of buffers, logs, and blocksZubair Nabi
 
AOS Lab 8: Interrupts and Device Drivers
AOS Lab 8: Interrupts and Device DriversAOS Lab 8: Interrupts and Device Drivers
AOS Lab 8: Interrupts and Device DriversZubair Nabi
 
AOS Lab 7: Page tables
AOS Lab 7: Page tablesAOS Lab 7: Page tables
AOS Lab 7: Page tablesZubair Nabi
 
AOS Lab 6: Scheduling
AOS Lab 6: SchedulingAOS Lab 6: Scheduling
AOS Lab 6: SchedulingZubair Nabi
 
AOS Lab 5: System calls
AOS Lab 5: System callsAOS Lab 5: System calls
AOS Lab 5: System callsZubair Nabi
 
AOS Lab 4: If you liked it, then you should have put a “lock” on it
AOS Lab 4: If you liked it, then you should have put a “lock” on itAOS Lab 4: If you liked it, then you should have put a “lock” on it
AOS Lab 4: If you liked it, then you should have put a “lock” on itZubair Nabi
 
AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!Zubair Nabi
 
AOS Lab 2: Hello, xv6!
AOS Lab 2: Hello, xv6!AOS Lab 2: Hello, xv6!
AOS Lab 2: Hello, xv6!Zubair Nabi
 
AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!Zubair Nabi
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data StackZubair Nabi
 
Raabta: Low-cost Video Conferencing for the Developing World
Raabta: Low-cost Video Conferencing for the Developing WorldRaabta: Low-cost Video Conferencing for the Developing World
Raabta: Low-cost Video Conferencing for the Developing WorldZubair Nabi
 
The Anatomy of Web Censorship in Pakistan
The Anatomy of Web Censorship in PakistanThe Anatomy of Web Censorship in Pakistan
The Anatomy of Web Censorship in PakistanZubair Nabi
 
MapReduce and DBMS Hybrids
MapReduce and DBMS HybridsMapReduce and DBMS Hybrids
MapReduce and DBMS HybridsZubair Nabi
 
Topic 15: Datacenter Design and Networking
Topic 15: Datacenter Design and NetworkingTopic 15: Datacenter Design and Networking
Topic 15: Datacenter Design and NetworkingZubair Nabi
 
Topic 14: Operating Systems and Virtualization
Topic 14: Operating Systems and VirtualizationTopic 14: Operating Systems and Virtualization
Topic 14: Operating Systems and VirtualizationZubair Nabi
 
Topic 13: Cloud Stacks
Topic 13: Cloud StacksTopic 13: Cloud Stacks
Topic 13: Cloud StacksZubair Nabi
 
Lab 5: Interconnecting a Datacenter using Mininet
Lab 5: Interconnecting a Datacenter using MininetLab 5: Interconnecting a Datacenter using Mininet
Lab 5: Interconnecting a Datacenter using MininetZubair Nabi
 

Mehr von Zubair Nabi (20)

AOS Lab 12: Network Communication
AOS Lab 12: Network CommunicationAOS Lab 12: Network Communication
AOS Lab 12: Network Communication
 
AOS Lab 11: Virtualization
AOS Lab 11: VirtualizationAOS Lab 11: Virtualization
AOS Lab 11: Virtualization
 
AOS Lab 10: File system -- Inodes and beyond
AOS Lab 10: File system -- Inodes and beyondAOS Lab 10: File system -- Inodes and beyond
AOS Lab 10: File system -- Inodes and beyond
 
AOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 9: File system -- Of buffers, logs, and blocksAOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 9: File system -- Of buffers, logs, and blocks
 
AOS Lab 8: Interrupts and Device Drivers
AOS Lab 8: Interrupts and Device DriversAOS Lab 8: Interrupts and Device Drivers
AOS Lab 8: Interrupts and Device Drivers
 
AOS Lab 7: Page tables
AOS Lab 7: Page tablesAOS Lab 7: Page tables
AOS Lab 7: Page tables
 
AOS Lab 6: Scheduling
AOS Lab 6: SchedulingAOS Lab 6: Scheduling
AOS Lab 6: Scheduling
 
AOS Lab 5: System calls
AOS Lab 5: System callsAOS Lab 5: System calls
AOS Lab 5: System calls
 
AOS Lab 4: If you liked it, then you should have put a “lock” on it
AOS Lab 4: If you liked it, then you should have put a “lock” on itAOS Lab 4: If you liked it, then you should have put a “lock” on it
AOS Lab 4: If you liked it, then you should have put a “lock” on it
 
AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!
 
AOS Lab 2: Hello, xv6!
AOS Lab 2: Hello, xv6!AOS Lab 2: Hello, xv6!
AOS Lab 2: Hello, xv6!
 
AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data Stack
 
Raabta: Low-cost Video Conferencing for the Developing World
Raabta: Low-cost Video Conferencing for the Developing WorldRaabta: Low-cost Video Conferencing for the Developing World
Raabta: Low-cost Video Conferencing for the Developing World
 
The Anatomy of Web Censorship in Pakistan
The Anatomy of Web Censorship in PakistanThe Anatomy of Web Censorship in Pakistan
The Anatomy of Web Censorship in Pakistan
 
MapReduce and DBMS Hybrids
MapReduce and DBMS HybridsMapReduce and DBMS Hybrids
MapReduce and DBMS Hybrids
 
Topic 15: Datacenter Design and Networking
Topic 15: Datacenter Design and NetworkingTopic 15: Datacenter Design and Networking
Topic 15: Datacenter Design and Networking
 
Topic 14: Operating Systems and Virtualization
Topic 14: Operating Systems and VirtualizationTopic 14: Operating Systems and Virtualization
Topic 14: Operating Systems and Virtualization
 
Topic 13: Cloud Stacks
Topic 13: Cloud StacksTopic 13: Cloud Stacks
Topic 13: Cloud Stacks
 
Lab 5: Interconnecting a Datacenter using Mininet
Lab 5: Interconnecting a Datacenter using MininetLab 5: Interconnecting a Datacenter using Mininet
Lab 5: Interconnecting a Datacenter using Mininet
 

Kürzlich hochgeladen

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Kürzlich hochgeladen (20)

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

Topic 5: MapReduce Theory and Implementation

  • 1. 5: MapReduce Theory and Implementation Zubair Nabi zubair.nabi@itu.edu.pk April 18, 2013 Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 1 / 34
  • 2. Outline 1 Introduction 2 Programming Model 3 Implementation 4 Refinements 5 Hadoop Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 2 / 34
  • 3. Outline 1 Introduction 2 Programming Model 3 Implementation 4 Refinements 5 Hadoop Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 3 / 34
  • 4. Common computations at Google Process large amounts of data generated from crawled documents, web request logs, etc. Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 4 / 34
  • 5. Common computations at Google Process large amounts of data generated from crawled documents, web request logs, etc. Compute inverted index, graph structure of web documents, summaries of pages crawled per host, etc. Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 4 / 34
  • 6. Common computations at Google Process large amounts of data generated from crawled documents, web request logs, etc. Compute inverted index, graph structure of web documents, summaries of pages crawled per host, etc. Common properties: 1 Computation is conceptually simple and is distributed across hundreds or thousands of machines to leverage parallelism Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 4 / 34
  • 7. Common computations at Google Process large amounts of data generated from crawled documents, web request logs, etc. Compute inverted index, graph structure of web documents, summaries of pages crawled per host, etc. Common properties: 1 Computation is conceptually simple and is distributed across hundreds or thousands of machines to leverage parallelism 2 Input data is large Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 4 / 34
  • 8. Common computations at Google Process large amounts of data generated from crawled documents, web request logs, etc. Compute inverted index, graph structure of web documents, summaries of pages crawled per host, etc. Common properties: 1 Computation is conceptually simple and is distributed across hundreds or thousands of machines to leverage parallelism 2 Input data is large 3 The original simple computation is made complex by system-level code to deal with issues of work assignment and distribution, and fault-tolerance Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 4 / 34
  • 9. Enter MapReduce Based on the insights mentioned in the previous slide, 2 Google Engineers, Jeff Dean and Sanjay Ghemawat, in 2004 designed MapReduce Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 5 / 34
  • 10. Enter MapReduce Based on the insights mentioned in the previous slide, 2 Google Engineers, Jeff Dean and Sanjay Ghemawat, in 2004 designed MapReduce Abstraction that helps the programmer express simple computations Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 5 / 34
  • 11. Enter MapReduce Based on the insights mentioned in the previous slide, 2 Google Engineers, Jeff Dean and Sanjay Ghemawat, in 2004 designed MapReduce Abstraction that helps the programmer express simple computations Hides the gory details of parallelization, fault-tolerance, data distribution, and load balancing Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 5 / 34
  • 12. Enter MapReduce Based on the insights mentioned in the previous slide, 2 Google Engineers, Jeff Dean and Sanjay Ghemawat, in 2004 designed MapReduce Abstraction that helps the programmer express simple computations Hides the gory details of parallelization, fault-tolerance, data distribution, and load balancing Relies on user-provided map and reduce primitives present in functional languages Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 5 / 34
  • 13. Enter MapReduce Based on the insights mentioned in the previous slide, 2 Google Engineers, Jeff Dean and Sanjay Ghemawat, in 2004 designed MapReduce Abstraction that helps the programmer express simple computations Hides the gory details of parallelization, fault-tolerance, data distribution, and load balancing Relies on user-provided map and reduce primitives present in functional languages Leverages one key insight: Most of the computation at Google involved applying a map operator to each logical record in the input dataset to obtain a set of intermediate key/value pairs and then applying a reduce operation to all values with the same key, for aggregation Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 5 / 34
  • 14. Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 6 / 34
  • 15. Outline 1 Introduction 2 Programming Model 3 Implementation 4 Refinements 5 Hadoop Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 7 / 34
  • 16. Programming Model Input: Set of key/value pairs Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 8 / 34
  • 17. Programming Model Input: Set of key/value pairs Output: Set of key/value pairs Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 8 / 34
  • 18. Programming Model Input: Set of key/value pairs Output: Set of key/value pairs The user provides the entire computation in the form of two functions: map and reduce Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 8 / 34
  • 19. User-defined functions 1 Map Takes an input pair and produces a set of intermediate key/value pairs Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 9 / 34
  • 20. User-defined functions 1 Map Takes an input pair and produces a set of intermediate key/value pairs The framework groups together the intermediate values by key for consumption by the Reduce Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 9 / 34
  • 21. User-defined functions 1 Map Takes an input pair and produces a set of intermediate key/value pairs The framework groups together the intermediate values by key for consumption by the Reduce 2 Reduce Takes as input a key and a list of associated values Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 9 / 34
  • 22. User-defined functions 1 Map Takes an input pair and produces a set of intermediate key/value pairs The framework groups together the intermediate values by key for consumption by the Reduce 2 Reduce Takes as input a key and a list of associated values In the common case, it merges these values to result in a smaller set of values Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 9 / 34
  • 23. Example: Word Count Counting the occurrence of each word in a large collection of documents Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 10 / 34
  • 24. Example: Word Count Counting the occurrence of each word in a large collection of documents 1 Map Emits each word and the value 1 Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 10 / 34
  • 25. Example: Word Count Counting the occurrence of each word in a large collection of documents 1 Map Emits each word and the value 1 2 Reduce Sums together all counts emitted for a particular word Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 10 / 34
  • 26. Example: Word Count(2) 1 map( String key , String value ): 2 // key: document name 3 // value : document contents 4 for each word w in value: 5 EmitIntermediate (w, "1"); 6 7 reduce ( String key , Iterator values ): 8 // key: a word 9 // values : a list of counts 10 int result = 0; 11 for each v in values : 12 result += ParseInt (v); 13 Emit( AsString ( result )); Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 11 / 34
  • 27. Types User-supplied map and reduce functions have associated types 1 Map map(k1, v1) → list(k2, v2) Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 12 / 34
  • 28. Types User-supplied map and reduce functions have associated types 1 Map map(k1, v1) → list(k2, v2) 2 Reduce reduce(k2, list(v2)) → list(v2) Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 12 / 34
  • 29. More applications Distributed Grep 1 Map Emits a line if its matches a user-provided pattern 2 Reduce Identity function Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 13 / 34
  • 30. More applications Distributed Grep 1 Map Emits a line if its matches a user-provided pattern 2 Reduce Identity function Count of URL Access Frequency 1 Map Similar to Word Count map. Instead of words we have URLs 2 Reduce Similar to Word Count reduce Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 13 / 34
  • 31. More applications (2) Inverted Index 1 Map Emits a sequence of < word, document_ID > 2 Reduce Emits < word, list(document_ID) > Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 14 / 34
  • 32. More applications (2) Inverted Index 1 Map Emits a sequence of < word, document_ID > 2 Reduce Emits < word, list(document_ID) > Distributed Sort 1 Map Identity 2 Reduce Identity Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 14 / 34
  • 33. Outline 1 Introduction 2 Programming Model 3 Implementation 4 Refinements 5 Hadoop Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 15 / 34
  • 34. Cluster architecture A large cluster of shared-nothing commodity machines connected via Ethernet Each node is an x86 system running Linux with local memory Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 16 / 34
  • 35. Cluster architecture A large cluster of shared-nothing commodity machines connected via Ethernet Each node is an x86 system running Linux with local memory Commodity networking hardware connected in the form of a tree topology Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 16 / 34
  • 36. Cluster architecture A large cluster of shared-nothing commodity machines connected via Ethernet Each node is an x86 system running Linux with local memory Commodity networking hardware connected in the form of a tree topology As clusters consist of hundreds or thousands of machines, failure is pretty common Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 16 / 34
  • 37. Cluster architecture A large cluster of shared-nothing commodity machines connected via Ethernet Each node is an x86 system running Linux with local memory Commodity networking hardware connected in the form of a tree topology As clusters consist of hundreds or thousands of machines, failure is pretty common Each machine consists of local hard-drives Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 16 / 34
  • 38. Cluster architecture A large cluster of shared-nothing commodity machines connected via Ethernet Each node is an x86 system running Linux with local memory Commodity networking hardware connected in the form of a tree topology As clusters consist of hundreds or thousands of machines, failure is pretty common Each machine consists of local hard-drives Google Filesystem runs atop of these disks which employs replication to ensure availability and reliability Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 16 / 34
  • 39. Cluster architecture A large cluster of shared-nothing commodity machines connected via Ethernet Each node is an x86 system running Linux with local memory Commodity networking hardware connected in the form of a tree topology As clusters consist of hundreds or thousands of machines, failure is pretty common Each machine consists of local hard-drives Google Filesystem runs atop of these disks which employs replication to ensure availability and reliability Jobs are submitted to a scheduler, which maps tasks within that job to available machines within the cluster Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 16 / 34
  • 40. MapReduce architecture 1 Master: In charge of all meta data, work scheduling and distribution, and job orchestration Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 17 / 34
  • 41. MapReduce architecture 1 Master: In charge of all meta data, work scheduling and distribution, and job orchestration 2 Workers: Contain slots to execute map or reduce functions Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 17 / 34
  • 42. Execution 1 The user writes map and reduce functions and stitches together a MapReduce specification with the location of the input dataset, number of reduce tasks, and other attributes Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 18 / 34
  • 43. Execution 1 The user writes map and reduce functions and stitches together a MapReduce specification with the location of the input dataset, number of reduce tasks, and other attributes 2 The master logically splits the input dataset into M splits, where M = (Input_dataset_size)/(GFS_block _size) Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 18 / 34
  • 44. Execution 1 The user writes map and reduce functions and stitches together a MapReduce specification with the location of the input dataset, number of reduce tasks, and other attributes 2 The master logically splits the input dataset into M splits, where M = (Input_dataset_size)/(GFS_block _size) The GFS block size is typically a multiple of 64MB Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 18 / 34
  • 45. Execution 1 The user writes map and reduce functions and stitches together a MapReduce specification with the location of the input dataset, number of reduce tasks, and other attributes 2 The master logically splits the input dataset into M splits, where M = (Input_dataset_size)/(GFS_block _size) The GFS block size is typically a multiple of 64MB 3 It then earmarks M map tasks and assigns them to workers. Each worker has a configurable number of task slots. Each time a worker completes a task, the master assigns it more pending map tasks Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 18 / 34
  • 46. Execution 1 The user writes map and reduce functions and stitches together a MapReduce specification with the location of the input dataset, number of reduce tasks, and other attributes 2 The master logically splits the input dataset into M splits, where M = (Input_dataset_size)/(GFS_block _size) The GFS block size is typically a multiple of 64MB 3 It then earmarks M map tasks and assigns them to workers. Each worker has a configurable number of task slots. Each time a worker completes a task, the master assigns it more pending map tasks 4 Once all map tasks have completed, the master assigns R reduce tasks to worker nodes Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 18 / 34
  • 47. Mappers 1 A map worker reads the contents of the input split that it has been assigned Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 19 / 34
  • 48. Mappers 1 A map worker reads the contents of the input split that it has been assigned 2 It parses the file and converts it to key/value pairs and invokes the user-defined map function for each pair Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 19 / 34
  • 49. Mappers 1 A map worker reads the contents of the input split that it has been assigned 2 It parses the file and converts it to key/value pairs and invokes the user-defined map function for each pair 3 The intermediate key/value pairs after the application of the map logic are collected (buffered) in memory Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 19 / 34
  • 50. Mappers 1 A map worker reads the contents of the input split that it has been assigned 2 It parses the file and converts it to key/value pairs and invokes the user-defined map function for each pair 3 The intermediate key/value pairs after the application of the map logic are collected (buffered) in memory 4 Once the buffered key/value pairs exceed a threshold they are written to local disk and partitioned (using a partitioning function) into R partitions. The location of each partition is passed to the master Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 19 / 34
  • 51. Reducers 1 A reduce worker gets locations of its input partitions from the master and uses HTTP requests to retrieve them Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 20 / 34
  • 52. Reducers 1 A reduce worker gets locations of its input partitions from the master and uses HTTP requests to retrieve them 2 Once it has read all its input, it sorts it by key to group together all occurrences of the same key Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 20 / 34
  • 53. Reducers 1 A reduce worker gets locations of its input partitions from the master and uses HTTP requests to retrieve them 2 Once it has read all its input, it sorts it by key to group together all occurrences of the same key 3 It then invokes the user-defined reduce for each key and passes it the key and its associated values Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 20 / 34
  • 54. Reducers 1 A reduce worker gets locations of its input partitions from the master and uses HTTP requests to retrieve them 2 Once it has read all its input, it sorts it by key to group together all occurrences of the same key 3 It then invokes the user-defined reduce for each key and passes it the key and its associated values 4 The key/value pairs generated after the application of the reduce logic are then written to a final output file, which is subsequently written to the distributed filesystem Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 20 / 34
  • 55. Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 21 / 34
  • 56. Book-keeping by the Master The master contains meta-data for all jobs running in the cluster Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 22 / 34
  • 57. Book-keeping by the Master The master contains meta-data for all jobs running in the cluster For each map and reduce tasks, it stores the state (pending, in-progress, or completed) and the ID of the worker on which it is executing (in-progress state) Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 22 / 34
  • 58. Book-keeping by the Master The master contains meta-data for all jobs running in the cluster For each map and reduce tasks, it stores the state (pending, in-progress, or completed) and the ID of the worker on which it is executing (in-progress state) It stores the locations and sizes of partitions for each map task Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 22 / 34
  • 59. Fault-tolerance For large compute clusters, failures are the norm rather than the exception Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 23 / 34
  • 60. Fault-tolerance For large compute clusters, failures are the norm rather than the exception 1 Worker: Each worker sends a periodic heartbeat signal to the master Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 23 / 34
  • 61. Fault-tolerance For large compute clusters, failures are the norm rather than the exception 1 Worker: Each worker sends a periodic heartbeat signal to the master If the master does not receive a heartbeat from a worker in a certain amount of time, it marks the worker as failed Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 23 / 34
  • 62. Fault-tolerance For large compute clusters, failures are the norm rather than the exception 1 Worker: Each worker sends a periodic heartbeat signal to the master If the master does not receive a heartbeat from a worker in a certain amount of time, it marks the worker as failed In-progress map and reduce tasks are simply re-executed on other nodes. Same goes for completed map tasks (as their output is lost on machine failure) Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 23 / 34
  • 63. Fault-tolerance For large compute clusters, failures are the norm rather than the exception 1 Worker: Each worker sends a periodic heartbeat signal to the master If the master does not receive a heartbeat from a worker in a certain amount of time, it marks the worker as failed In-progress map and reduce tasks are simply re-executed on other nodes. Same goes for completed map tasks (as their output is lost on machine failure) Completed reduce tasks are not re-executed as their output resides on the distributed filesystem Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 23 / 34
  • 64. Fault-tolerance For large compute clusters, failures are the norm rather than the exception 1 Worker: Each worker sends a periodic heartbeat signal to the master If the master does not receive a heartbeat from a worker in a certain amount of time, it marks the worker as failed In-progress map and reduce tasks are simply re-executed on other nodes. Same goes for completed map tasks (as their output is lost on machine failure) Completed reduce tasks are not re-executed as their output resides on the distributed filesystem 2 Master: The entire computation is marked as failed Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 23 / 34
  • 65. Fault-tolerance For large compute clusters, failures are the norm rather than the exception 1 Worker: Each worker sends a periodic heartbeat signal to the master If the master does not receive a heartbeat from a worker in a certain amount of time, it marks the worker as failed In-progress map and reduce tasks are simply re-executed on other nodes. Same goes for completed map tasks (as their output is lost on machine failure) Completed reduce tasks are not re-executed as their output resides on the distributed filesystem 2 Master: The entire computation is marked as failed But simple to keep the master soft state and re-spawn Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 23 / 34
  • 66. Locality Network bandwidth is a scare resource in typical clusters Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 24 / 34
  • 67. Locality Network bandwidth is a scare resource in typical clusters GFS slices files into 64MB blocks and stores 3 replicas across the cluster Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 24 / 34
  • 68. Locality Network bandwidth is a scare resource in typical clusters GFS slices files into 64MB blocks and stores 3 replicas across the cluster The master exploits this information by scheduling a map task near its input data. Preference is in the order, node-local, rack/switch-local, and any Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 24 / 34
  • 69. Speculative re-execution Every now and then the entire computation is held-up by a “straggler” task Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 25 / 34
  • 70. Speculative re-execution Every now and then the entire computation is held-up by a “straggler” task Stragglers can arise due to a number of reasons, such as machine load, network traffic, software/hardware bugs, etc. Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 25 / 34
  • 71. Speculative re-execution Every now and then the entire computation is held-up by a “straggler” task Stragglers can arise due to a number of reasons, such as machine load, network traffic, software/hardware bugs, etc. To deal with stragglers, the master speculatively re-executes slow tasks on other machines Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 25 / 34
  • 72. Speculative re-execution Every now and then the entire computation is held-up by a “straggler” task Stragglers can arise due to a number of reasons, such as machine load, network traffic, software/hardware bugs, etc. To deal with stragglers, the master speculatively re-executes slow tasks on other machines The task is marked as completed whenever the primary or the backup finishes its execution Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 25 / 34
  • 73. Scalability Possible to run on multiple scales: from single nodes to data centers with tens of thousands of nodes Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 26 / 34
  • 74. Scalability Possible to run on multiple scales: from single nodes to data centers with tens of thousands of nodes Nodes can be added/removed on the fly to scale up/down Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 26 / 34
  • 75. Outline 1 Introduction 2 Programming Model 3 Implementation 4 Refinements 5 Hadoop Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 27 / 34
  • 76. Partitioning By default MapReduce uses hash partitioning to partition the key space hash(key) % R Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 28 / 34
  • 77. Partitioning By default MapReduce uses hash partitioning to partition the key space hash(key) % R Optionally, the user can provide a custom partitioning function to say, negate skew or to ensure that certain keys always end up at a particular reduce worker Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 28 / 34
  • 78. Combiner function For reduce functions which are commutative and associative, the user can additionally provide a combiner function which is applied to the output of the map for local merging Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 29 / 34
  • 79. Combiner function For reduce functions which are commutative and associative, the user can additionally provide a combiner function which is applied to the output of the map for local merging Typically, the same reduce function is used as a combiner Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 29 / 34
  • 80. Input/output formats By default, the library supports a number of input/output formats Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 30 / 34
  • 81. Input/output formats By default, the library supports a number of input/output formats For instance, text as input and key/value pairs as output Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 30 / 34
  • 82. Input/output formats By default, the library supports a number of input/output formats For instance, text as input and key/value pairs as output Optionally, the user can specify custom input readers and output writers Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 30 / 34
  • 83. Input/output formats By default, the library supports a number of input/output formats For instance, text as input and key/value pairs as output Optionally, the user can specify custom input readers and output writers For instance, to read/write from/to a database Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 30 / 34
  • 84. Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 31 / 34
  • 85. Outline 1 Introduction 2 Programming Model 3 Implementation 4 Refinements 5 Hadoop Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 32 / 34
  • 86. Hadoop Open-source implementation of MapReduce, developed by Doug Cutting originally at Yahoo! in 2004 Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 33 / 34
  • 87. Hadoop Open-source implementation of MapReduce, developed by Doug Cutting originally at Yahoo! in 2004 Now a top-level Apache open-source project Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 33 / 34
  • 88. Hadoop Open-source implementation of MapReduce, developed by Doug Cutting originally at Yahoo! in 2004 Now a top-level Apache open-source project Implemented in Java (Google’s in-house implementation is in C++) Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 33 / 34
  • 89. Hadoop Open-source implementation of MapReduce, developed by Doug Cutting originally at Yahoo! in 2004 Now a top-level Apache open-source project Implemented in Java (Google’s in-house implementation is in C++) Comes with an associated distributed filesystem, HDFS (clone of GFS) Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 33 / 34
  • 90. References Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Symposium on Operating Systems Design & Implementation - (OSDI’04), Vol. 6. USENIX Association, Berkeley, CA, USA. Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 34 / 34