1. VanPyz, June 2, 2009
Introduction to MapReduce
using Disco
Erlang and Python
by @JimRoepcke
1
2. Computing at Google Scale
Image Source: http://ischool.tv/news/files/2006/12/computer-grid02s.jpg
Massive databases and data
streams need to be processed
quickly and reliably
Thousands of commodity PCs
available in Google’s cluster
for computations
Faults are statistically
“guaranteed” to occur
2
3. Google’s Motivation
Google has thousands of programs to process user-
generated data
Even simple computations were being obscured by the
complex code required to run efficiently and reliably on
their clusters.
Engineers shouldn’t have to be experts in distributed
systems to write scalable data-processing software.
3
4. Why not just use threads?
Threads only add concurrency, only on one node
Does not scale to > 1 node, a cluster, or a cloud
Coordinating work between nodes requires distribution
middleware
MapReduce is distribution middleware
MapReduce scales linearly with cores / nodes
4
6. Disco
Created by Ville Tuulos of the Nokia Research Center
Written in Erlang and Python
Does not include a distributed File System
Provide your own data distribution mechanism
6
11. Master splits input
The (typically huge) input is split into chunks
One or more for each “map worker”
11
12. Splits fed to map workers
The master tells each map worker which split(s) it will
process
A split is a file containing some number of input
records
Each record has a key and its associated value
12
13. Map each input
The map worker executes your problem-specific map
algorithm
Called once for each record in its input
13
14. Map emits (Key,Value) pairs
Your map algorithm emits zero or more intermediate
key-value pairs for each record processed
Let’s call these “(K,V) pairs” from now on
Keys and values are both strings
14
15. (K,V) Pairs hashed to buckets
Each map worker has its own set of buckets
Each (K,V) pair is placed into one of these buckets
Which bucket is determined by a hash function
Advanced: if you know the distribution of your
intermediate keys is skewed, provide a custom hash
function that distributes (K,V) pairs evenly
15
16. Buckets sent to Reducers
Once all map workers are finished, corresponding
buckets of (K,V) pairs are sent to reduce workers
Example: Each map worker placed (K,V) pairs into its
own buckets A, B, and C.
Send bucket A from each map to reduce worker 1;
Send bucket B from each map to reduce worker 2;
Send bucket C from each map to reduce worker 3.
16
17. Reduce inputs sorted
The reduce worker first concatenates the buckets it
received into one file
Then the file of (K,V) pairs is sorted by K
Now the (K,V) pairs are grouped by key
This sorted list of (K,V) pairs is the input to the reduce
worker
17
18. Reduce the list of (K,V) pairs
The reduce worker executes your problem-specific
reduce algorithm
Called once for each key in its input
Writes whatever it wants to its output file
18
19. Output
The output of the MapReduce job is the set of output
files generated by the reduce workers
What you do with this output is up to you
You might use this output as the input to another
MapReduce job
19
20. Modified from source: http://labs.google.com/papers/mapreduce-osdi04.pdf
Example: Counting words
def map (key, value):
# key: document name (ignored)
# value: words in document (list)
for word in value:
EmitIntermediate(word, “1”)
def reduce (key, values):
# key: a word
# values: a list of counts
result = 0
for v in values:
result += int(v)
print key, result
20
21. Stand up! Let’s do it!
Organize yourselves into approximately equal numbers
of map and reduce workers
I’ll be the master
22. Disco demonstration
Wanted to demonstrate a cool
puzzle solver.
No go, but I can show the code.
It’s really simple!
Instead you get count_words again,
but scaled way up!
python count_words.py
disco://localhost