In this session you will learn:
1. Meet MapReduce
2. Word Count Algorithm – Traditional approach
3. Traditional approach on a Distributed System
4. Traditional approach – Drawbacks
5. MapReduce Approach
6. Input & Output Forms of a MR program
7. Map, Shuffle & Sort, Reduce Phase
8. WordCount Code walkthrough
9. Workflow & Transformation of Data
10. Input Split & HDFS Block
11. Relation between Split & Block
12. Data locality Optimization
13. Speculative Execution
14. MR Flow with Single Reduce Task
15. MR flow with multiple Reducers
16. Input Format & Hierarchy
17. Output Format & Hierarchy
5. Page 5Classification: Restricted
In pioneer days they used oxen for heavy pulling, and when on ox couldn’t
budge a log,they didn’t try to grow a larger ox. We shouldn’t be trying for
bigger computers, but for more systems of computers.
8. Page 8Classification: Restricted
Meet MapReduce
• MapReduce is a programming model for distributed processing
• Advantage - easy scaling of data processing over multiple computing nodes
• The basic entities in this model are – mappers & reducers
• Decomposing a data processing application into mappers and reducers
is the task of developer
• once you write an application in the MapReduce form, scaling the
application to run over hundreds, thousands, or even tens of thousands of
machines in a cluster is merely a configuration change
12. Page 12Classification: Restricted
WordCount – Traditional Approach
• The program loops through all the documents. For each document, the
words are extracted one by one using a tokenization process. For each
word, its corresponding entry in a multiset called wordCount is
incremented by one. At the end, a display ()function prints out all the
entries in wordCount.
• A multiset is a set where each element also has a count. The word count
we’re trying to generate is a canonical example of a multiset. In practice, it’s
usually implemented as a hash table.
define wordCount as Multiset;
for each document in documentSet {
T = tokenize(document);
for each token in T {
wordCount[token]++;
}
}
display(wordCount);
13. Page 13Classification: Restricted
Traditional Approach – Distributed Processing
define wordCount as Multiset;
for each document in documentSubset {
< same code as in perv.slide>
}
sendToSecondPhase(wordCount);
define totalWordCount as Multiset;
for each wordCount received from firstPhase {
multisetAdd (totalWordCount, wordCount);
}
14. Page 14Classification: Restricted
Traditional Approach – Drawbacks
• Central Storage – bottleneck in bandwidth of the server
• Multiple Storage – handling splits
• Program runs in memory
• When processing large document sets, the number of unique words can
exceed the RAM storage of a machine
• Phase 2 handling by one machine?
• If Multiple machines are used for phase-2, how to partition the data?
15. Page 15Classification: Restricted
Mapreduce Approach
• Has two execution phases – mapping & reducing
• These phases are defined by data processing functions called – mapper &
reducer
• Mapping phase – MR takes the input data and feeds each data element to
the mapper
• Reducing phase – reducer processes all the outputs from the mapper and
arrives at a final result
16. Page 16Classification: Restricted
Input & Output forms:
• In order for mapping, reducing, partitioning, and shuffling (and a few others
that were not mentioned) to seamlessly work together, we need to agree
on a common structure for the data being processed
• InputFormat class is responsible for creating input splits and dividing them
into records()
Input Output
map() <k1, v1> list(<k2, v2>)
reduce() <k2, list(v2)> list(<k3, v3>)
17. Page 17Classification: Restricted
Input & Output forms:
• Input & output forms should be flexible and powerful enough to handle
most of the targeted data processing applications. MapReduce
uses lists and(key/value) pairs as its main data primitives.
• The keys and values are often integers or strings but can also be dummy
values to be ignored or complex object types.
21. Page 21Classification: Restricted
MR - Work flow & Transformation of data
From i/p
files to the
mapper
From the
Mapper to
the
intermediate
results
From
intermediate
results to
the reducer
From the
reducer to
output files
22. Page 22Classification: Restricted
Word Count: Source Code
• Key points to note:
1.In MR, map() processes one record at a time, where as traditional
approaches process one document at a time.
2.The new classes that we have seen (Text, IntWritable, LongWritable etc.,)
have additional serialization capabilities. (Will discuss in detail later)
• Source Code: http://hadoop.apache.org/docs/current/hadoop-mapreduce-
client/hadoop-mapreduce-client-core/MapReduceTutorial.html
24. Page 24Classification: Restricted
Relation Between Input Split & Hdfs Block
1 2 3 4 76 8 1
0
95
File
Lin
es
Block
Bounda
ry
Block
Bounda
ry
Block
Bounda
ry
Block
Bounda
ry
Split Split Split
• Logical records do not fit neatly into the HDFS blocks.
• Logical records are lines that cross the boundary of the blocks.
• First split contains line 5 although it spans across blocks.
25. Page 25Classification: Restricted
Data locality Optimization
• MR job is split into various map &
reduce tasks
• Map tasks run on the input splits
• Ideally, the task JVM would be
initiated in the node where the
split/block of data exists
• While in some scenarios, JVMs might
not be free to accept another task.
• In that case, Task Tracker will be
initiated at a different location.
• Scenario a) Same node execution
• Scenario b) Off-node execution
• Scenario c) Off-rack execution
26. Page 26Classification: Restricted
Speculative execution
• MR job is split into various map & reduce tasks and they get executed in
parallel.
• Overall job execution time is pulled down by the slowest task.
• Hadoop doesn’t try to diagnose and fix slow-running tasks; instead, it tries
to detect when a task is running slower than expected and launches
another equivalent task as a backup. This is
termed speculative execution of tasks.
30. Page 30Classification: Restricted
Combiner
• A combiner is a mini-reducer
• It gets executed on the mapper output at the mapper side
• Combiner’s output is fed to Reducer
• As the mapper output is further refined using combiner, data that has to be
shuffled across the cluster is minimized
• Because the combiner function is an optimization, Hadoop does not
provide a guarantee of how many times it will call it for a particular map
output record,
if at all
• So, calling the combiner function zero, one, or many times should produce
the same output from the reducer.
31. Page 31Classification: Restricted
Combiner’s Contract
• Only those functions that obey commutative & associative properties can
use combiners.
• Because
max(0, 20, 10, 25, 15) = max(max(0, 20, 10), max(25, 15)) = max(20,
25) = 25
where as,
mean(0, 20, 10, 25, 15) = 14 and
mean(mean(0, 20, 10), mean(25, 15)) = mean(10, 20) = 15
Can a combiner replace a reducer?
32. Page 32Classification: Restricted
Partitioner
• We know that a unique key will always go to a unique reducer.
• Partitioner is responsible for sending key, value pairs to a reducer based on
the key content.
• The default partitioner is Hash-partitioner. It takes mapper output, create a
Hash value for each key and divide it modulo by the number of reducers.
The output of this calculation will determine the reducer that this particular
key would go to
33. Page 33Classification: Restricted
Partitioner
public class HashPartitioner<K, V> extends Partitioner<K, V> {
public int getPartition(K key, V value, int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
}
2%3 = 1
3%3=0
4%3=1
5%3=2
6%3=0
2%4 = 2
3%4=1
4%4=0
5%4=1
6%4=2
7%4=3
36. Page 36Classification: Restricted
InputFormat Hierarchy
• An Input split is a chunk of the input that is processed by a single map. Each
map processes a single split. Each split is divided into records, and the map
processes each record—a key-value pair—in turn. Splits and records are
logical: there is nothing that requires them to be tied to files, for example,
although in their most common incarnations, they are.
• In a database context, a split might correspond to a range of rows from a
table and a record to a row in that range.
• An InputFormat is responsible for creating the input splits and dividing
them into records.
37. Page 37Classification: Restricted
InputFormat Hierarchy
public abstract class InputFormat<K, V> {
public abstract List<InputSplit> getSplits(JobContext context) throws
IOException, InterruptedException;
public abstract RecordReader<K, V> createRecordReader(InputSplit split,
TaskAttemptContext context) throws
IOException, InterruptedException;
}
Client calls getSplits() & map task calls createRecordReader()
• FileInputFormat is the base class for all implementations
of InputFormat that use files as their data source
• It provides two things: a place to define which files are included as the input
to a job, and an implementation for generating splits for the input files. The
job of dividing splits into records is performed by subclasses.
38. Page 38Classification: Restricted
InputFormat Hierarchy
public static void addInputPath(Job job, Path path)
public static void addInputPaths(Job job, String commaSeparatedPaths)
public static void setInputPaths(Job job, Path... inputPaths)
public static void setInputPaths(Job job, String commaSeparatedPaths)
42. Page 42Classification: Restricted
Counters
• Map input records
• Map output records
• Filesystem bytes read
• Launched map tasks
• Failed map tasks
• Killed reduce tasks
• Counters are a useful channel for
gathering statistics about the job:
for quality control or for
application-level statistics.
• Often used for debugging purpose.
• eg: Count number of Good records,
bad records in the input
• Two types – Built-in & Custom
Counters
• Examples of Built-in Counters:
43. Page 43Classification: Restricted
Joins
• Map-side join(Replication): A map-side join that works in situations
where one of the datasets is small enough to cache
• Reduce-side join(Repartition join): A reduce-side join for situations where
you’re joining two or more large datasets together
• Semi-join(A map-side join): Another map-side join where one dataset is
initially too large to fit into memory, but after some filtering
can be reduced down to a size that can fit in memory
44. Page 44Classification: Restricted
Distributed Cache
• Side data can be defined as extra read-only data needed by a job to process
the main dataset
• To make side data available to all map or reduce tasks, we distribute those
datasets using Hadoop’s Distributed Cache mechanism.