This reader contains an introduction to MapReduce jobs. It covers some important classes within the r0.20.2 version of Hadoop, the setup of an empty application and a simple assignment that can be used to get familiar. It was created for the Hadoop Hackathon at SARA (http://www.sara.nl) during December 7th, 2010.
Note that the URL's used in this document might not persist.
Scanning the Internet for External Cloud Exposures via SSL Certs
Hadoop Hackathon Reader
1. SARA Hadoop Hackathon
December 2010
Table of Contents
An introduction to Java MapReduce jobs in Apache Hadoop...................................................................1
org.apache.hadoop.mapreduce.InputFormat.........................................................................................1
org.apache.hadoop.mapreduce.Mapper.................................................................................................1
org.apache.hadoop.io.SequenceFile and org.apache.hadoop.mapreduce.Partitioner............................2
org.apache.hadoop.mapreduce.Reducer................................................................................................2
org.apache.hadoop.mapreduce.OutputFormat.......................................................................................3
An empty Hadoop MapReduce job in Java................................................................................................3
org.apache.hadoop.util.Tool..................................................................................................................3
org.apache.hadoop.mapreduce.Mapper.................................................................................................5
org.apache.hadoop.mapreduce.Reducer................................................................................................6
A simple try-out: top Wikipedia page views..............................................................................................6
The setup...............................................................................................................................................7
Our Tool.................................................................................................................................................7
Our Mapper...........................................................................................................................................7
Our Reducer..........................................................................................................................................8
An introduction to Java MapReduce jobs in Apache Hadoop
A MapReduce job written in Java typically exists of the following components:
1. An InputFormat
2. A Mapper
3. A SequenceFile and Partioner
4. A Reducer
5. An OutputFormat
org.apache.hadoop.mapreduce.InputFormat
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/InputFormat.html
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/InputSplit.html
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/RecordReader.html
It is the InputFormat's reponsibility to:
• Validate the input of the Job
• Split up the input into logical InputSplits, which will be assigned to each Mapper
• Provide an implementation of a RecordReader, which is used by a Mapper to read input
records from the logical InputSplit.
org.apache.hadoop.mapreduce.Mapper
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/Mapper.html
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/RecordReader.html
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/InputSplit.html
2. @See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/Mapper.Context.html
A Mapper implements application specific logic. It reades a set of key / value pairs as input from a
RecordReader, and generates a set of key / value pairs as output.
A Mapper should override a function map:
public void map(KEYIN key, VALUEIN value, Mapper.Context context)
Every time a Mapper gets initialized – which happens once for each InputSplit – a function is
called to setup the Object. You can optionally override this function and do your own setup:
public void setup(Mapper.Context context)
Similarly, you can override a cleanup function that is called when the Mapper object is destoyed:
public void cleanup(Mapper.Context context)
Output from a Mapper is collected from within Mapper.map(). The Context object, provided as
parameter to the function, exposes a function that must be used for this task:
public void write(KEYOUT key, VALUEOUT value)
org.apache.hadoop.io.SequenceFile and
org.apache.hadoop.mapreduce.Partitioner
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/SequenceFile.html
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/Partitioner.html
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/SequenceFile.Sorter.html
Temporary outputs from the Mappers are stored in SequenceFile. This is a binary representation of
key / value pairs. A SequenceFile object provides a:
• java.io.Reader
• java.io.Writer
• and SequenceFile.Sorter.
If the job is configured to use more than one Reducer then the sorted SequenceFile is partitioned
by a Partioner, creating as many partitions as Reducers. The partitioning is done by executing
some function on each key in the SequenceFile, typically a hash function. Each Reducer then
fetches a range of keys, assembled from all SequenceFiles produced by the Mappers, over the
internal network using HTTP. These individual sorted ranges are then merged into a single sorted
range. These events are usually collectively referred to as the “shuffle phase”.
org.apache.hadoop.mapreduce.Reducer
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/Reducer.html
A Reducer, like a Mapper, implements application specific logic. You can draw an analogy with SQL
to understand the distinction. In an SQL SELECT query, the input data (a table) is filtered by zero or
more conditions in a WHERE clause. The resulting data is optionally grouped, maybe because of a
GROUP BY clause, and after that the aggregate functions can be applied (SUM(), AVG(), COUNT(),
etcetera). The conditional logic of a query in MapReduce terms, is done by the Mapper. When the
3. Mappers are finished, the resulting data is sorted on the keys. The Reducers take care of the aggregate
functions (and can be arbitrarily complex). (This analogy is actually part of a discussion going on for
some time now1.)
A Reducer, after having completed the shuffle phase, has a number of keys, each with one or more
values, to apply its logic to. Like Mapper, Reducer has setup and cleanup functions that can be
overridden. The application logic is applied through the function:
public void reduce(KEYIN key, Iterable<VALUEIN> values, Reducer.Context context)
org.apache.hadoop.mapreduce.OutputFormat
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/OutputFormat.html
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/RecordWriter.html
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/fs/FileSystem.html
The OutputFormat is responsible for:
• validating the job's output specification
• Provide an implementation of RecordWriter to be used to write output files of the job. The
output is written to a FileSystem.
An empty Hadoop MapReduce job in Java
Any MapReduce job in Java implements a minimum of three classes:
• a Tool
• a Mapper
• and a Reducer
An implementation of an empty MapReduce job that can be used as base for new jobs, can be found in
a SARA Subversion repository https://subtrac.sara.nl/oss/svn/hadoop/trunk/BaseProject/. Read-only
access is provided for anonymous users. The example code in this document is simplified code from
this repository.
org.apache.hadoop.util.Tool
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/util/Tool.html
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/conf/Configurable.html
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/conf/Configured.html
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/util/ToolRunner.html
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/conf/Configuration.html
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/Job.html
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.html
1 http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html
4. An implementation of Tool is the single point of entry for any Hadoop MapReduce application. The
implementing class should expose a main() method. It is commonly used to configure the job – either
through the parsing of command-line options, static configuration in the code itself or a combination of
both. The Tool interface itself, has Configurable as its Superinterface. Therefore, an
implementation of tool must either subclass an implementation of Configurable or implement the
interface itself. The typical Hadoop MapReduce application subclasses Configured, which is an
implementation of Configurable.
Next to the main method, an implementation of Tool should override:
public int run(String[] args) throws Exception
The run method is responsible for actually configuring and running the Job. See here a simplified
implementation of Tool. Below the code is a step-by-step explanation.
public class RunnerTool extends Configured implements Tool {
/**
* An org.apache.commons.logging.Log object
*/
private static final Log LOG = LogFactory.getLog(RunnerTool.class.getName());
/**
* This function handles configuration and submission of your
* MapReduce job.
* @return 1 on failure, 0 on success
* @throws Exception
*/
@Override
public int run(String[] arg0) throws Exception {
Configuration conf = getConf();
Job job = new Job(conf);
job.setJarByClass(RunnerTool.class);
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
FileInputFormat.addInputPath(job, new Path("indir"));
FileOutputFormat.setOutputPath(job, new Path("outdir" + System.nanoTime()));
if (!job.waitForCompletion(true)) {
LOG.error("Job failed!");
return 1;
}
return 0;
}
/**
* Main method. Runs the application.
* @param args
* @throws Exception
*/
public static void main(String... args) throws Exception {
System.exit(ToolRunner.run(new RunnerTool(), args));
}
}
1. The main method uses the static ToolRunner.run() method. This method parses generic
5. Hadoop command line options2 and, if necessary, modifies the Configuration object. After
that it calls RunnerTool.run().
2. Our RunnerTool.run() method starts by fetching the job's Configuration object. The
object can then be used to further configure the job, using the set*() functions.
3. Then a Job is being created using the Configuration object, and we let the Job know what
jar it came from by calling its setJarByClass() method.
4. We need to tell our Job which Mapper and Reducer it should use by calling the
setMapperByClass() and setReducerByClass() methods.
5. Now we tell the Job on what data it will operate (FileInputFormat.addInputPath(), call
once for each file) and where it should store it's output data
(FileOutputFormat.setOutputPath()) (note: the output directory should not yet exist!).
6. The Job has all information it needs now, and is being submitted by calling
job.waitForCompletion().
org.apache.hadoop.mapreduce.Mapper
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/LongWritable.html
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/Text.html
Our empty Mapper class only provides us the setup() and map() functions. It is worth to note that,
using Java generics, we tell the Mapper that the type of:
1. the input key will be LongWritable (an object wrapper for the long datatype)
2. the input value will be Text (an object wrapper for text)
3. the output key will be LongWritable as well
4. the output value will be Text.
public class MyMapper extends Mapper<LongWritable, Text, LongWritable, Text> {
/**
* An org.apache.commons.logging.Log object
*/
private static final Log LOG = LogFactory.getLog(RunnerTool.class.getName());
/**
* This function is called once during the start of the map phase.
* @param context The job Context
*/
@Override
public void setup(Context context) {
}
/**
* This function holds the mapper logic
* @param key The key of the K/V input pair
* @param value The value of the K/V input pair
* @param context The context of the application
2 http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/util/GenericOptionsParser.html#GenericOptions
6. */
@Override
public void map(LongWritable key, Text value, Context context) {
}
}
org.apache.hadoop.mapreduce.Reducer
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/LongWritable.html
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/Text.html
Our empty Reducer, like the Mapper, only provides the setup() and reduce() functions. Also like
our Mapper, using Java generics, we tell the Reducer that the type of:
1. the input key will be LongWritable
2. the input value will be Text
3. the output key will be LongWritable as well
4. the output value will be Text.
public class MyReducer extends Reducer<LongWritable, Text, LongWritable, Text> {
/**
* The LOG Object
*/
private static final Log LOG = LogFactory.getLog(RunnerTool.class.getName());
/**
* This function is called once during the start of the reduce phase.
* @param context The job Context
*/
@Override
public void setup(Context context) {
}
/**
* This function holds the reducer logic.
* @param key The key of the input K/V pair
* @param values Values associated with key
* @param context The context of the application
*/
@Override
public void reduce(LongWritable key, Iterable<Text> values, Context context) {
}
}
A simple try-out: top Wikipedia page views
Courtesy of Edgar Meij3, UvA ILPS, we have access to a sample dataset containing the amount of page
views per article, per language code, during a single hour. The data is structured as follows:
[language_code] [article_name] [page_views] [transfered_bytes]
3 edgar.meij@uva.nl
7. In the below example data, the English language article about Amsterdam has been viewed 215 times
during a certain hour, and these views generated a total of 23312999 bytes (~23MB) of traffic.
en Amsterdam 215 23312999
You can download the sample dataset from
https://subtrac.sara.nl/oss/svn/hadoop/branches/WikipediaPageCounts/in-dir/.
Data like this could give us an interesting view on the usage of Wikipedia. Say we have this data
collected over a period of months or even longer. We would be able to see the 'rise and fall' in terms of
popularity of a certain page over time, and maybe try to find a relation between the evolution of the
article and its relative size by looking at the total amount of transferred bytes.
But you can start simpler: by extracting the top [N] viewed pages per language code. You can use the
empty MapReduce classes from the previous chapter as a starting point.
The setup
Our Mapper will output the language code as key, and the page views and article title as value – for
each line in our input file.
Our Reducer – which gets the data after the shuffle phase is done and all values are sorted – will get
all pages associated with a single language code. The Reducer will maintain a top [N] list of the pages
it has seen, and output this list when it has checked all values.
The implementation of Tool we will use has the responsibility to read a single argument: [N]. It further
more needs to tell the Job how to handle our Mapper and Reducer, particularly about the expected
InputFormat and OutputFormat, and the outputKeyClass and outputValueClass.
Our Tool
@JavaDoc:
https://subtrac.sara.nl/oss/svn/hadoop/branches/WikipediaPageCounts/doc/javadoc/nl/sara/hadoop/RunnerTool.html
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.html
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/lib/output/TextOutputFormat.html
See here the steps you need to take to get a functional implementation of Tool for this job. Hint: use the
previous chapters if you miss information, and try to get familiar with the API's by looking at the
documentation.
1. Our Tool will accept a single argument, N. It will have to pass the argument on from the
main() method to the run() method – keeping missing input in mind, of course. After that it
should use the Configuration.set() method to pass the configuration on to the job.
2. Since we are dealing with plain text, organized in single lines, we can use Hadoop's native
TextInputFormat type to deal with our input file and create FileSplits for our Mapper.
3. The output will be lines in the form of [language_code] [article_name] [page_views]. We can
easily store this as plain text, so we can use Hadoop's TextOutputFormat.
4. Since both the key and the value we will store in our TextOutputFormat will be of type
Text, we should tell our job to expect these types.
8. Our Mapper
@Javadoc:
https://subtrac.sara.nl/oss/svn/hadoop/branches/WikipediaPageCounts/doc/javadoc/nl/sara/hadoop/MyMapper.html
Our Mapper is trivially simple. It needs to split the input value (the TextInputFormat gives the line
itself as value, and the position of the first character of the line in the file as key) on spaces. If that was
successful, it should output the first word – the language code – as key, and the remainder as value.
Even though this is a trivial action and can be written as a single line of code, make sure to deal with
Exceptions. You cannot expect every line in the text to be structured the exact same, and a fact of life
is that most datasets you will work with do not strictly apply to structure. Fault tolerance can be
achieved by using many try / catch blocks in your code, while (especially during development) logging
all entries that raise an Exception.
Our Reducer
@Javadoc:
https://subtrac.sara.nl/oss/svn/hadoop/branches/WikipediaPageCounts/doc/javadoc/nl/sara/hadoop/MyReducer.html
The Reducer is a bit less trivial. We want to loop over all values we receive for a certain key – a
language code in our case -, and maintain a top [N] of most viewed pages. Every time we process a
new value, we should check whether it is higher than the lowest value in our top [N], and replace the
lowest value with the current one if it is.
A TreeMap object comes in handy for storing the top [N], since it stores its values sorted. This makes
finding the current lowest value of your top very easy.