Learning Objectives - In this module, you will understand Hadoop MapReduce framework and how MapReduce works on data stored in HDFS. Also, you will learn what are the different types of Input and Output formats in MapReduce framework and their usage.
2. Hadoop Data Types (http://hadoop.apache.org/docs/current/api/index.html)
org.apache.hadoop.io
• int -> IntWritable , long -> LongWritable , boolean -> BooleanWritable , float -> FloatWritable , byte -> ByteWritable
We can use the following built-in data types as key and value
• Text :This stores a UTF8 text
• ByteWritable : This stores a sequence of bytes
• VIntWritable and VLongWritable : These stores variable length integer and long values
• Nullwritable: This is zero-length Writable type that can be used when you don’t want to use a key or value type
• Key class, should implement the WritableComparable interface.
• Value class, should implement the of a Writable interface.
E.g.
public class IntWritable implements WritableComparable
public abstract interface WritableComparable<T> extends Writable, Comparable<T>
4. MapReduce paradigm
• Splits input files into blocks (typically of 64 MB each)
• Operates on key/value pairs
• Mappers filter & transform input data
• Reducers aggregate mappers output
• Efficient way to process the cluster:
• Move code to data
• Run code on all machines
• Divide & conquer: partition a large problem into smaller sub
problems
• Independent sub-problems can be executed in parallel by workers (anything
from threads to clusters)
• Intermediate results from each worker are combined to get the final result
5. MapReduce paradigm contd..
• Challenges:
• How to transform a problem into sub-problems?
• How to assign workers and synchronize the intermediate results?
• How do the workers get the required data?
• How to handle failures in the cluster?
9. Combiners
• Combiner: local aggregation of key/value pairs after map() and before the shuffle
& sort phase (occurs on the same machine as map())
• Also called “mini-reducer”
• Instead of emitting 100 times (the,1), the combiner emits (the,100)
• Can lead to great speed-ups and save network bandwidth
• Each combiner operates in isolation, has no access to other mapper’s key/value
pairs
• A combiner cannot be assumed to process all values associated with the same
key (may not run at all! Hadoop’s decision)
• Emitted key/value pairs must be the same as those emitted by the mapper
10. Combiners contd..
• If the function computed is
• Commutative [a + b = b + a]
• Associative [a + (b + c) = (a + b) + c]
Reducer can be reused as combiner
Max function works:
max (max ( a, b), max (c, d, e)) = max (a, b, c, d, e)
Mean function does not work:
mean(mean(a, b), mean(c, d, e)) != mean (a, b, c, d, e)
11. MapReduce Programming: Word Count
WordCountDriver.java
public class WordCountDriver extends Configured implements Tool {
public int run(String[] args) throws Exception {
Configuration conf = new Configuration();
GenericOptionsParser parser = new GenericOptionsParser(conf, args);
args = parser.getRemainingArgs();
Job job = new Job(conf, "wordcount"); job.setJarByClass(WordCountDriver.class);
job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path("E:aainputnames.txt"));
FileOutputFormat.setOutputPath(job, new Path("E:aaoutput"));
job.setMapperClass(WordCountMapper.class); job.setReducerClass(WordCountReducer.class);
if (job.waitForCompletion(true)) { return 1;
} else { return 0;
}}}
12. MapReduce Programming: Word Count
WordCountMapper.java
public class WordCountMapper extends Mapper<LongWritable, Text,
Text, IntWritable>{
private Text word = new Text();
private final static IntWritable one = new IntWritable(1);
public void map(LongWritable key, Text value, Context
context)
throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}}}
13. MapReduce Programming: Word Count
WordCountReducer.java
public class WordCountReducer extends
Reducer<Text, IntWritable, Text, IntWritable> {
protected void reduce(Text key, Iterable<IntWritable>
values,Context context) throws IOException,
InterruptedException {
int sum = 0;
for (IntWritable value : values) {
sum += value.get();
}
context.write(key, new IntWritable(sum));
}
}
14. A minimal MapReduce driver
public class MinimalMapReduceWithDefaults extends Configured implements Tool {
public int run(String[] args) throws Exception {
Job job = new Job(getConf());
job.setInputFormatClass(TextInputFormat.class);
job.setMapperClass(Mapper.class);
job.setMapOutputKeyClass(LongWritable.class);
job.setMapOutputValueClass(Text.class);
job.setPartitionerClass(HashPartitioner.class);
job.setNumReduceTasks(1);
job.setReducerClass(Reducer.class);
job.setOutputKeyClass(LongWritable.class);
job.setOutputValueClass(Text.class);
job.setOutputFormatClass(TextOutputFormat.class);
return job.waitForCompletion(true) ? 0 : 1; }
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new MinimalMapReduceWithDefaults(), args);
System.exit(exitCode);}}
15. Input Splits and Records
• Input split is a chunk of the input that is processed by a single map.
• Each map processes a single split.
• Each split is divided into records, and the map processes each
record—a key-value pair—in turn.
public abstract class InputSplit {
public abstract long getLength() throws IOException, InterruptedException;
public abstract String[] getLocations() throws IOException, InterruptedException;
}
16. InputFormat
• An InputFormat is responsible for creating the input splits and
dividing them into records.
public abstract class InputFormat<K, V> {
public abstract List<InputSplit> getSplits(JobContext context) throws
IOException, InterruptedException;
public abstract RecordReader<K, V> createRecordReader(InputSplit split,
TaskAttemptContext context) throws IOException, InterruptedException;
}
18. FileInputFormat
• A place to define which files are included as the input to a job.
• An implementation for generating splits for the input files.
FileInputFormat input paths
public static void addInputPath(Job job, Path path)
public static void setInputPaths(Job job, Path... inputPaths)
FileInputFormat input splits
max(minimumSize, min(maximumSize, blockSize))
by default: minimumSize < blockSize < maximumSize
20. Text Input : TextInputFormat
• TextInputFormat is the default InputFormat.
• Each record is a line of input.
• The key, a LongWritable, is the byte offset within the file of the beginning of the
line.
• The value is the contents of the line, excluding any line terminators (newline,
carriage return), and is packaged as a Text object.
21. Binary Input: SequenceFileInputFormat
• Hadoop’s sequence file format stores sequences of binary key-value
pairs.
• Sequence files are well suited as a format for MapReduce data since
they are splitable.
• Support compression as a part of the format.
24. Output Types
• Text Output
• The default output format, TextOutputFormat, writes records as lines of text.
• Binary Output
• SequenceFileOutputFormat writes sequence files for its output.
• Multiple Outputs
• MultipleOutputs allows you to write data to files whose names are derived
from the output keys and values, or in fact from an arbitrary string.