Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Topic 5: MapReduce Theory and Implementation
1. 5: MapReduce Theory and Implementation
Zubair Nabi
zubair.nabi@itu.edu.pk
April 18, 2013
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 1 / 34
2. Outline
1 Introduction
2 Programming Model
3 Implementation
4 Refinements
5 Hadoop
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 2 / 34
3. Outline
1 Introduction
2 Programming Model
3 Implementation
4 Refinements
5 Hadoop
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 3 / 34
4. Common computations at Google
Process large amounts of data generated from crawled documents,
web request logs, etc.
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 4 / 34
5. Common computations at Google
Process large amounts of data generated from crawled documents,
web request logs, etc.
Compute inverted index, graph structure of web documents,
summaries of pages crawled per host, etc.
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 4 / 34
6. Common computations at Google
Process large amounts of data generated from crawled documents,
web request logs, etc.
Compute inverted index, graph structure of web documents,
summaries of pages crawled per host, etc.
Common properties:
1 Computation is conceptually simple and is distributed across hundreds
or thousands of machines to leverage parallelism
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 4 / 34
7. Common computations at Google
Process large amounts of data generated from crawled documents,
web request logs, etc.
Compute inverted index, graph structure of web documents,
summaries of pages crawled per host, etc.
Common properties:
1 Computation is conceptually simple and is distributed across hundreds
or thousands of machines to leverage parallelism
2 Input data is large
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 4 / 34
8. Common computations at Google
Process large amounts of data generated from crawled documents,
web request logs, etc.
Compute inverted index, graph structure of web documents,
summaries of pages crawled per host, etc.
Common properties:
1 Computation is conceptually simple and is distributed across hundreds
or thousands of machines to leverage parallelism
2 Input data is large
3 The original simple computation is made complex by system-level code
to deal with issues of work assignment and distribution, and
fault-tolerance
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 4 / 34
9. Enter MapReduce
Based on the insights mentioned in the previous slide, 2 Google
Engineers, Jeff Dean and Sanjay Ghemawat, in 2004 designed
MapReduce
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 5 / 34
10. Enter MapReduce
Based on the insights mentioned in the previous slide, 2 Google
Engineers, Jeff Dean and Sanjay Ghemawat, in 2004 designed
MapReduce
Abstraction that helps the programmer express simple computations
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 5 / 34
11. Enter MapReduce
Based on the insights mentioned in the previous slide, 2 Google
Engineers, Jeff Dean and Sanjay Ghemawat, in 2004 designed
MapReduce
Abstraction that helps the programmer express simple computations
Hides the gory details of parallelization, fault-tolerance, data distribution,
and load balancing
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 5 / 34
12. Enter MapReduce
Based on the insights mentioned in the previous slide, 2 Google
Engineers, Jeff Dean and Sanjay Ghemawat, in 2004 designed
MapReduce
Abstraction that helps the programmer express simple computations
Hides the gory details of parallelization, fault-tolerance, data distribution,
and load balancing
Relies on user-provided map and reduce primitives present in functional
languages
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 5 / 34
13. Enter MapReduce
Based on the insights mentioned in the previous slide, 2 Google
Engineers, Jeff Dean and Sanjay Ghemawat, in 2004 designed
MapReduce
Abstraction that helps the programmer express simple computations
Hides the gory details of parallelization, fault-tolerance, data distribution,
and load balancing
Relies on user-provided map and reduce primitives present in functional
languages
Leverages one key insight: Most of the computation at Google involved
applying a map operator to each logical record in the input dataset to
obtain a set of intermediate key/value pairs and then applying a reduce
operation to all values with the same key, for aggregation
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 5 / 34
14. Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 6 / 34
15. Outline
1 Introduction
2 Programming Model
3 Implementation
4 Refinements
5 Hadoop
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 7 / 34
16. Programming Model
Input: Set of key/value pairs
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 8 / 34
17. Programming Model
Input: Set of key/value pairs
Output: Set of key/value pairs
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 8 / 34
18. Programming Model
Input: Set of key/value pairs
Output: Set of key/value pairs
The user provides the entire computation in the form of two functions:
map and reduce
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 8 / 34
19. User-defined functions
1 Map
Takes an input pair and produces a set of intermediate key/value pairs
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 9 / 34
20. User-defined functions
1 Map
Takes an input pair and produces a set of intermediate key/value pairs
The framework groups together the intermediate values by key for
consumption by the Reduce
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 9 / 34
21. User-defined functions
1 Map
Takes an input pair and produces a set of intermediate key/value pairs
The framework groups together the intermediate values by key for
consumption by the Reduce
2 Reduce
Takes as input a key and a list of associated values
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 9 / 34
22. User-defined functions
1 Map
Takes an input pair and produces a set of intermediate key/value pairs
The framework groups together the intermediate values by key for
consumption by the Reduce
2 Reduce
Takes as input a key and a list of associated values
In the common case, it merges these values to result in a smaller set of
values
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 9 / 34
23. Example: Word Count
Counting the occurrence of each word in a large collection of documents
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 10 / 34
24. Example: Word Count
Counting the occurrence of each word in a large collection of documents
1 Map
Emits each word and the value 1
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 10 / 34
25. Example: Word Count
Counting the occurrence of each word in a large collection of documents
1 Map
Emits each word and the value 1
2 Reduce
Sums together all counts emitted for a particular word
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 10 / 34
26. Example: Word Count(2)
1 map( String key , String value ):
2 // key: document name
3 // value : document contents
4 for each word w in value:
5 EmitIntermediate (w, "1");
6
7 reduce ( String key , Iterator values ):
8 // key: a word
9 // values : a list of counts
10 int result = 0;
11 for each v in values :
12 result += ParseInt (v);
13 Emit( AsString ( result ));
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 11 / 34
27. Types
User-supplied map and reduce functions have associated types
1 Map
map(k1, v1) → list(k2, v2)
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 12 / 34
28. Types
User-supplied map and reduce functions have associated types
1 Map
map(k1, v1) → list(k2, v2)
2 Reduce
reduce(k2, list(v2)) → list(v2)
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 12 / 34
29. More applications
Distributed Grep
1 Map
Emits a line if its matches a user-provided pattern
2 Reduce
Identity function
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 13 / 34
30. More applications
Distributed Grep
1 Map
Emits a line if its matches a user-provided pattern
2 Reduce
Identity function
Count of URL Access Frequency
1 Map
Similar to Word Count map. Instead of words we have URLs
2 Reduce
Similar to Word Count reduce
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 13 / 34
31. More applications (2)
Inverted Index
1 Map
Emits a sequence of < word, document_ID >
2 Reduce
Emits < word, list(document_ID) >
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 14 / 34
32. More applications (2)
Inverted Index
1 Map
Emits a sequence of < word, document_ID >
2 Reduce
Emits < word, list(document_ID) >
Distributed Sort
1 Map
Identity
2 Reduce
Identity
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 14 / 34
33. Outline
1 Introduction
2 Programming Model
3 Implementation
4 Refinements
5 Hadoop
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 15 / 34
34. Cluster architecture
A large cluster of shared-nothing commodity machines connected via
Ethernet
Each node is an x86 system running Linux with local memory
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 16 / 34
35. Cluster architecture
A large cluster of shared-nothing commodity machines connected via
Ethernet
Each node is an x86 system running Linux with local memory
Commodity networking hardware connected in the form of a tree
topology
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 16 / 34
36. Cluster architecture
A large cluster of shared-nothing commodity machines connected via
Ethernet
Each node is an x86 system running Linux with local memory
Commodity networking hardware connected in the form of a tree
topology
As clusters consist of hundreds or thousands of machines, failure is
pretty common
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 16 / 34
37. Cluster architecture
A large cluster of shared-nothing commodity machines connected via
Ethernet
Each node is an x86 system running Linux with local memory
Commodity networking hardware connected in the form of a tree
topology
As clusters consist of hundreds or thousands of machines, failure is
pretty common
Each machine consists of local hard-drives
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 16 / 34
38. Cluster architecture
A large cluster of shared-nothing commodity machines connected via
Ethernet
Each node is an x86 system running Linux with local memory
Commodity networking hardware connected in the form of a tree
topology
As clusters consist of hundreds or thousands of machines, failure is
pretty common
Each machine consists of local hard-drives
Google Filesystem runs atop of these disks which employs replication to
ensure availability and reliability
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 16 / 34
39. Cluster architecture
A large cluster of shared-nothing commodity machines connected via
Ethernet
Each node is an x86 system running Linux with local memory
Commodity networking hardware connected in the form of a tree
topology
As clusters consist of hundreds or thousands of machines, failure is
pretty common
Each machine consists of local hard-drives
Google Filesystem runs atop of these disks which employs replication to
ensure availability and reliability
Jobs are submitted to a scheduler, which maps tasks within that job to
available machines within the cluster
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 16 / 34
40. MapReduce architecture
1 Master: In charge of all meta data, work scheduling and distribution,
and job orchestration
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 17 / 34
41. MapReduce architecture
1 Master: In charge of all meta data, work scheduling and distribution,
and job orchestration
2 Workers: Contain slots to execute map or reduce functions
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 17 / 34
42. Execution
1 The user writes map and reduce functions and stitches together a
MapReduce specification with the location of the input dataset, number
of reduce tasks, and other attributes
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 18 / 34
43. Execution
1 The user writes map and reduce functions and stitches together a
MapReduce specification with the location of the input dataset, number
of reduce tasks, and other attributes
2 The master logically splits the input dataset into M splits, where
M = (Input_dataset_size)/(GFS_block _size)
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 18 / 34
44. Execution
1 The user writes map and reduce functions and stitches together a
MapReduce specification with the location of the input dataset, number
of reduce tasks, and other attributes
2 The master logically splits the input dataset into M splits, where
M = (Input_dataset_size)/(GFS_block _size)
The GFS block size is typically a multiple of 64MB
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 18 / 34
45. Execution
1 The user writes map and reduce functions and stitches together a
MapReduce specification with the location of the input dataset, number
of reduce tasks, and other attributes
2 The master logically splits the input dataset into M splits, where
M = (Input_dataset_size)/(GFS_block _size)
The GFS block size is typically a multiple of 64MB
3 It then earmarks M map tasks and assigns them to workers. Each
worker has a configurable number of task slots. Each time a worker
completes a task, the master assigns it more pending map tasks
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 18 / 34
46. Execution
1 The user writes map and reduce functions and stitches together a
MapReduce specification with the location of the input dataset, number
of reduce tasks, and other attributes
2 The master logically splits the input dataset into M splits, where
M = (Input_dataset_size)/(GFS_block _size)
The GFS block size is typically a multiple of 64MB
3 It then earmarks M map tasks and assigns them to workers. Each
worker has a configurable number of task slots. Each time a worker
completes a task, the master assigns it more pending map tasks
4 Once all map tasks have completed, the master assigns R reduce
tasks to worker nodes
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 18 / 34
47. Mappers
1 A map worker reads the contents of the input split that it has been
assigned
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 19 / 34
48. Mappers
1 A map worker reads the contents of the input split that it has been
assigned
2 It parses the file and converts it to key/value pairs and invokes the
user-defined map function for each pair
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 19 / 34
49. Mappers
1 A map worker reads the contents of the input split that it has been
assigned
2 It parses the file and converts it to key/value pairs and invokes the
user-defined map function for each pair
3 The intermediate key/value pairs after the application of the map logic
are collected (buffered) in memory
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 19 / 34
50. Mappers
1 A map worker reads the contents of the input split that it has been
assigned
2 It parses the file and converts it to key/value pairs and invokes the
user-defined map function for each pair
3 The intermediate key/value pairs after the application of the map logic
are collected (buffered) in memory
4 Once the buffered key/value pairs exceed a threshold they are written
to local disk and partitioned (using a partitioning function) into R
partitions. The location of each partition is passed to the master
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 19 / 34
51. Reducers
1 A reduce worker gets locations of its input partitions from the master
and uses HTTP requests to retrieve them
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 20 / 34
52. Reducers
1 A reduce worker gets locations of its input partitions from the master
and uses HTTP requests to retrieve them
2 Once it has read all its input, it sorts it by key to group together all
occurrences of the same key
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 20 / 34
53. Reducers
1 A reduce worker gets locations of its input partitions from the master
and uses HTTP requests to retrieve them
2 Once it has read all its input, it sorts it by key to group together all
occurrences of the same key
3 It then invokes the user-defined reduce for each key and passes it the
key and its associated values
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 20 / 34
54. Reducers
1 A reduce worker gets locations of its input partitions from the master
and uses HTTP requests to retrieve them
2 Once it has read all its input, it sorts it by key to group together all
occurrences of the same key
3 It then invokes the user-defined reduce for each key and passes it the
key and its associated values
4 The key/value pairs generated after the application of the reduce logic
are then written to a final output file, which is subsequently written to
the distributed filesystem
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 20 / 34
55. Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 21 / 34
56. Book-keeping by the Master
The master contains meta-data for all jobs running in the cluster
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 22 / 34
57. Book-keeping by the Master
The master contains meta-data for all jobs running in the cluster
For each map and reduce tasks, it stores the state (pending,
in-progress, or completed) and the ID of the worker on which it is
executing (in-progress state)
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 22 / 34
58. Book-keeping by the Master
The master contains meta-data for all jobs running in the cluster
For each map and reduce tasks, it stores the state (pending,
in-progress, or completed) and the ID of the worker on which it is
executing (in-progress state)
It stores the locations and sizes of partitions for each map task
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 22 / 34
59. Fault-tolerance
For large compute clusters, failures are the norm rather than the exception
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 23 / 34
60. Fault-tolerance
For large compute clusters, failures are the norm rather than the exception
1 Worker:
Each worker sends a periodic heartbeat signal to the master
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 23 / 34
61. Fault-tolerance
For large compute clusters, failures are the norm rather than the exception
1 Worker:
Each worker sends a periodic heartbeat signal to the master
If the master does not receive a heartbeat from a worker in a certain
amount of time, it marks the worker as failed
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 23 / 34
62. Fault-tolerance
For large compute clusters, failures are the norm rather than the exception
1 Worker:
Each worker sends a periodic heartbeat signal to the master
If the master does not receive a heartbeat from a worker in a certain
amount of time, it marks the worker as failed
In-progress map and reduce tasks are simply re-executed on other
nodes. Same goes for completed map tasks (as their output is lost on
machine failure)
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 23 / 34
63. Fault-tolerance
For large compute clusters, failures are the norm rather than the exception
1 Worker:
Each worker sends a periodic heartbeat signal to the master
If the master does not receive a heartbeat from a worker in a certain
amount of time, it marks the worker as failed
In-progress map and reduce tasks are simply re-executed on other
nodes. Same goes for completed map tasks (as their output is lost on
machine failure)
Completed reduce tasks are not re-executed as their output resides on
the distributed filesystem
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 23 / 34
64. Fault-tolerance
For large compute clusters, failures are the norm rather than the exception
1 Worker:
Each worker sends a periodic heartbeat signal to the master
If the master does not receive a heartbeat from a worker in a certain
amount of time, it marks the worker as failed
In-progress map and reduce tasks are simply re-executed on other
nodes. Same goes for completed map tasks (as their output is lost on
machine failure)
Completed reduce tasks are not re-executed as their output resides on
the distributed filesystem
2 Master:
The entire computation is marked as failed
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 23 / 34
65. Fault-tolerance
For large compute clusters, failures are the norm rather than the exception
1 Worker:
Each worker sends a periodic heartbeat signal to the master
If the master does not receive a heartbeat from a worker in a certain
amount of time, it marks the worker as failed
In-progress map and reduce tasks are simply re-executed on other
nodes. Same goes for completed map tasks (as their output is lost on
machine failure)
Completed reduce tasks are not re-executed as their output resides on
the distributed filesystem
2 Master:
The entire computation is marked as failed
But simple to keep the master soft state and re-spawn
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 23 / 34
66. Locality
Network bandwidth is a scare resource in typical clusters
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 24 / 34
67. Locality
Network bandwidth is a scare resource in typical clusters
GFS slices files into 64MB blocks and stores 3 replicas across the
cluster
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 24 / 34
68. Locality
Network bandwidth is a scare resource in typical clusters
GFS slices files into 64MB blocks and stores 3 replicas across the
cluster
The master exploits this information by scheduling a map task near its
input data. Preference is in the order, node-local, rack/switch-local, and
any
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 24 / 34
69. Speculative re-execution
Every now and then the entire computation is held-up by a “straggler”
task
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 25 / 34
70. Speculative re-execution
Every now and then the entire computation is held-up by a “straggler”
task
Stragglers can arise due to a number of reasons, such as machine
load, network traffic, software/hardware bugs, etc.
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 25 / 34
71. Speculative re-execution
Every now and then the entire computation is held-up by a “straggler”
task
Stragglers can arise due to a number of reasons, such as machine
load, network traffic, software/hardware bugs, etc.
To deal with stragglers, the master speculatively re-executes slow tasks
on other machines
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 25 / 34
72. Speculative re-execution
Every now and then the entire computation is held-up by a “straggler”
task
Stragglers can arise due to a number of reasons, such as machine
load, network traffic, software/hardware bugs, etc.
To deal with stragglers, the master speculatively re-executes slow tasks
on other machines
The task is marked as completed whenever the primary or the backup
finishes its execution
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 25 / 34
73. Scalability
Possible to run on multiple scales: from single nodes to data centers
with tens of thousands of nodes
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 26 / 34
74. Scalability
Possible to run on multiple scales: from single nodes to data centers
with tens of thousands of nodes
Nodes can be added/removed on the fly to scale up/down
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 26 / 34
75. Outline
1 Introduction
2 Programming Model
3 Implementation
4 Refinements
5 Hadoop
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 27 / 34
76. Partitioning
By default MapReduce uses hash partitioning to partition the key
space
hash(key) % R
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 28 / 34
77. Partitioning
By default MapReduce uses hash partitioning to partition the key
space
hash(key) % R
Optionally, the user can provide a custom partitioning function to say,
negate skew or to ensure that certain keys always end up at a
particular reduce worker
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 28 / 34
78. Combiner function
For reduce functions which are commutative and associative, the user
can additionally provide a combiner function which is applied to the
output of the map for local merging
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 29 / 34
79. Combiner function
For reduce functions which are commutative and associative, the user
can additionally provide a combiner function which is applied to the
output of the map for local merging
Typically, the same reduce function is used as a combiner
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 29 / 34
80. Input/output formats
By default, the library supports a number of input/output formats
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 30 / 34
81. Input/output formats
By default, the library supports a number of input/output formats
For instance, text as input and key/value pairs as output
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 30 / 34
82. Input/output formats
By default, the library supports a number of input/output formats
For instance, text as input and key/value pairs as output
Optionally, the user can specify custom input readers and output
writers
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 30 / 34
83. Input/output formats
By default, the library supports a number of input/output formats
For instance, text as input and key/value pairs as output
Optionally, the user can specify custom input readers and output
writers
For instance, to read/write from/to a database
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 30 / 34
84. Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 31 / 34
85. Outline
1 Introduction
2 Programming Model
3 Implementation
4 Refinements
5 Hadoop
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 32 / 34
86. Hadoop
Open-source implementation of MapReduce, developed by Doug
Cutting originally at Yahoo! in 2004
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 33 / 34
87. Hadoop
Open-source implementation of MapReduce, developed by Doug
Cutting originally at Yahoo! in 2004
Now a top-level Apache open-source project
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 33 / 34
88. Hadoop
Open-source implementation of MapReduce, developed by Doug
Cutting originally at Yahoo! in 2004
Now a top-level Apache open-source project
Implemented in Java (Google’s in-house implementation is in C++)
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 33 / 34
89. Hadoop
Open-source implementation of MapReduce, developed by Doug
Cutting originally at Yahoo! in 2004
Now a top-level Apache open-source project
Implemented in Java (Google’s in-house implementation is in C++)
Comes with an associated distributed filesystem, HDFS (clone of GFS)
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 33 / 34
90. References
Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified
data processing on large clusters. In Proceedings of the 6th
Symposium on Operating Systems Design & Implementation -
(OSDI’04), Vol. 6. USENIX Association, Berkeley, CA, USA.
Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 34 / 34