SlideShare ist ein Scribd-Unternehmen logo
1 von 8
Downloaden Sie, um offline zu lesen
SARA Hadoop Hackathon
                                                                   December 2010


Table of Contents
An introduction to Java MapReduce jobs in Apache Hadoop...................................................................1
   org.apache.hadoop.mapreduce.InputFormat.........................................................................................1
   org.apache.hadoop.mapreduce.Mapper.................................................................................................1
   org.apache.hadoop.io.SequenceFile and org.apache.hadoop.mapreduce.Partitioner............................2
   org.apache.hadoop.mapreduce.Reducer................................................................................................2
   org.apache.hadoop.mapreduce.OutputFormat.......................................................................................3
An empty Hadoop MapReduce job in Java................................................................................................3
   org.apache.hadoop.util.Tool..................................................................................................................3
   org.apache.hadoop.mapreduce.Mapper.................................................................................................5
   org.apache.hadoop.mapreduce.Reducer................................................................................................6
A simple try-out: top Wikipedia page views..............................................................................................6
   The setup...............................................................................................................................................7
   Our Tool.................................................................................................................................................7
   Our Mapper...........................................................................................................................................7
   Our Reducer..........................................................................................................................................8


An introduction to Java MapReduce jobs in Apache Hadoop
A MapReduce job written in Java typically exists of the following components:
   1. An InputFormat
   2. A Mapper
   3. A SequenceFile and Partioner
   4. A Reducer
   5. An OutputFormat

org.apache.hadoop.mapreduce.InputFormat
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/InputFormat.html
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/InputSplit.html
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/RecordReader.html

It is the InputFormat's reponsibility to:
      • Validate the input of the Job
      • Split up the input into logical InputSplits, which will be assigned to each Mapper
      • Provide an implementation of a RecordReader, which is used by a Mapper to read input
          records from the logical InputSplit.

org.apache.hadoop.mapreduce.Mapper
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/Mapper.html
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/RecordReader.html
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/InputSplit.html
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/Mapper.Context.html

A Mapper implements application specific logic. It reades a set of key / value pairs as input from a
RecordReader, and generates a set of key / value pairs as output.

A Mapper should override a function map:
public void map(KEYIN key, VALUEIN value, Mapper.Context context) 


Every time a Mapper gets initialized – which happens once for each InputSplit – a function is
called to setup the Object. You can optionally override this function and do your own setup:
public void setup(Mapper.Context context) 


Similarly, you can override a cleanup function that is called when the Mapper object is destoyed:
public void cleanup(Mapper.Context context) 


Output from a Mapper is collected from within Mapper.map(). The Context object, provided as
parameter to the function, exposes a function that must be used for this task:
public void write(KEYOUT key, VALUEOUT value)



org.apache.hadoop.io.SequenceFile and
    org.apache.hadoop.mapreduce.Partitioner
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/SequenceFile.html
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/Partitioner.html
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/SequenceFile.Sorter.html


Temporary outputs from the Mappers are stored in SequenceFile. This is a binary representation of
key / value pairs. A SequenceFile object provides a:
   • java.io.Reader
   • java.io.Writer
   • and SequenceFile.Sorter.

If the job is configured to use more than one Reducer then the sorted SequenceFile is partitioned
by a Partioner, creating as many partitions as Reducers. The partitioning is done by executing
some function on each key in the SequenceFile, typically a hash function. Each Reducer then
fetches a range of keys, assembled from all SequenceFiles produced by the Mappers, over the
internal network using HTTP. These individual sorted ranges are then merged into a single sorted
range. These events are usually collectively referred to as the “shuffle phase”.

org.apache.hadoop.mapreduce.Reducer
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/Reducer.html


A Reducer, like a Mapper, implements application specific logic. You can draw an analogy with SQL
to understand the distinction. In an SQL SELECT query, the input data (a table) is filtered by zero or
more conditions in a WHERE clause. The resulting data is optionally grouped, maybe because of a
GROUP BY clause, and after that the aggregate functions can be applied (SUM(), AVG(), COUNT(),
etcetera). The conditional logic of a query in MapReduce terms, is done by the Mapper. When the
Mappers are finished, the resulting data is sorted on the keys. The Reducers take care of the aggregate
functions (and can be arbitrarily complex). (This analogy is actually part of a discussion going on for
some time now1.)


A Reducer, after having completed the shuffle phase, has a number of keys, each with one or more
values, to apply its logic to. Like Mapper, Reducer has setup and cleanup functions that can be
overridden. The application logic is applied through the function:
public void reduce(KEYIN key, Iterable<VALUEIN> values, Reducer.Context context)


org.apache.hadoop.mapreduce.OutputFormat
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/OutputFormat.html
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/RecordWriter.html
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/fs/FileSystem.html


The OutputFormat is responsible for:
    •   validating the job's output specification
    •   Provide an implementation of RecordWriter to be used to write output files of the job. The
        output is written to a FileSystem.


An empty Hadoop MapReduce job in Java
Any MapReduce job in Java implements a minimum of three classes:
    •   a Tool
    •   a Mapper
    •   and a Reducer

An implementation of an empty MapReduce job that can be used as base for new jobs, can be found in
a SARA Subversion repository https://subtrac.sara.nl/oss/svn/hadoop/trunk/BaseProject/. Read-only
access is provided for anonymous users. The example code in this document is simplified code from
this repository.



org.apache.hadoop.util.Tool
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/util/Tool.html
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/conf/Configurable.html
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/conf/Configured.html
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/util/ToolRunner.html
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/conf/Configuration.html
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/Job.html
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.html



1 http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html
An implementation of Tool is the single point of entry for any Hadoop MapReduce application. The
implementing class should expose a main() method. It is commonly used to configure the job – either
through the parsing of command-line options, static configuration in the code itself or a combination of
both. The Tool interface itself, has Configurable as its Superinterface. Therefore, an
implementation of tool must either subclass an implementation of Configurable or implement the
interface itself. The typical Hadoop MapReduce application subclasses Configured, which is an
implementation of Configurable.


Next to the main method, an implementation of Tool should override:
public int run(String[] args) throws Exception


The run method is responsible for actually configuring and running the Job. See here a simplified
implementation of Tool. Below the code is a step-by-step explanation.


public class RunnerTool extends Configured implements Tool {

    /**
     * An org.apache.commons.logging.Log object
     */
    private static final Log LOG = LogFactory.getLog(RunnerTool.class.getName());

    /**
     * This function handles configuration and submission of your
     * MapReduce job.
     * @return 1 on failure, 0 on success
     * @throws Exception
     */
    @Override
    public int run(String[] arg0) throws Exception {
        Configuration conf = getConf();
        Job job = new Job(conf);
        job.setJarByClass(RunnerTool.class);
        job.setMapperClass(MyMapper.class);
        job.setReducerClass(MyReducer.class);
        
        FileInputFormat.addInputPath(job, new Path("in­dir"));
        FileOutputFormat.setOutputPath(job, new Path("out­dir­" + System.nanoTime()));
        
        if (!job.waitForCompletion(true)) {
            LOG.error("Job failed!");
            return 1;
        }
        return 0;
    }
    
    /**
     * Main method. Runs the application.
     * @param args
     * @throws Exception
     */
    public static void main(String... args) throws Exception {
        System.exit(ToolRunner.run(new RunnerTool(), args));
    }

}



    1. The main method uses the static ToolRunner.run() method. This method parses generic
Hadoop command line options2 and, if necessary, modifies the Configuration object. After
        that it calls RunnerTool.run().
    2. Our RunnerTool.run() method starts by fetching the job's Configuration object. The
       object can then be used to further configure the job, using the set*() functions.
    3. Then a Job is being created using the Configuration object, and we let the Job know what
       jar it came from by calling its setJarByClass() method.
    4. We need to tell our Job which Mapper and Reducer it should use by calling the
       setMapperByClass() and setReducerByClass() methods.
    5. Now we tell the Job on what data it will operate (FileInputFormat.addInputPath(), call
       once for each file) and where it should store it's output data
       (FileOutputFormat.setOutputPath()) (note: the output directory should not yet exist!).
    6. The Job has all information it needs now, and is being submitted by calling
       job.waitForCompletion().



org.apache.hadoop.mapreduce.Mapper
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/LongWritable.html
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/Text.html


Our empty Mapper class only provides us the setup() and map() functions. It is worth to note that,
using Java generics, we tell the Mapper that the type of:
    1. the input key will be LongWritable (an object wrapper for the long datatype)
    2. the input value will be Text (an object wrapper for text)
    3. the output key will be LongWritable as well
    4. the output value will be Text.


public class MyMapper extends Mapper<LongWritable, Text, LongWritable, Text> {

    /**
     * An org.apache.commons.logging.Log object
     */
    private static final Log LOG = LogFactory.getLog(RunnerTool.class.getName());

    /**
     * This function is called once during the start of the map phase.
     * @param context The job Context
     */
    @Override
    public void setup(Context context) {
        
    }

    /**
     * This function holds the mapper logic
     * @param key The key of the K/V input pair
     * @param value The value of the K/V input pair
     * @param context The context of the application

2 http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/util/GenericOptionsParser.html#GenericOptions
     */
    @Override
    public void map(LongWritable key, Text value, Context context) {
        
    }

}




org.apache.hadoop.mapreduce.Reducer
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/LongWritable.html
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/Text.html


Our empty Reducer, like the Mapper, only provides the setup() and reduce() functions. Also like
our Mapper, using Java generics, we tell the Reducer that the type of:
    1. the input key will be LongWritable
    2. the input value will be Text
    3. the output key will be LongWritable as well
    4. the output value will be Text.


public class MyReducer extends Reducer<LongWritable, Text, LongWritable, Text> {

    /**
     * The LOG Object
     */
    private static final Log LOG = LogFactory.getLog(RunnerTool.class.getName());

    /**
     * This function is called once during the start of the reduce phase.
     * @param context The job Context
     */
    @Override
    public void setup(Context context) {
        
    }
    
    /**
     * This function holds the reducer logic.
     * @param key The key of the input K/V pair
     * @param values Values associated with key
     * @param context The context of the application
     */
    @Override
    public void reduce(LongWritable key, Iterable<Text> values, Context context) {
        
    }
}


A simple try-out: top Wikipedia page views
Courtesy of Edgar Meij3, UvA ILPS, we have access to a sample dataset containing the amount of page
views per article, per language code, during a single hour. The data is structured as follows:
[language_code] [article_name] [page_views] [transfered_bytes]




3 edgar.meij@uva.nl
In the below example data, the English language article about Amsterdam has been viewed 215 times
during a certain hour, and these views generated a total of 23312999 bytes (~23MB) of traffic.
en Amsterdam 215 23312999


You can download the sample dataset from
https://subtrac.sara.nl/oss/svn/hadoop/branches/WikipediaPageCounts/in-dir/.
Data like this could give us an interesting view on the usage of Wikipedia. Say we have this data
collected over a period of months or even longer. We would be able to see the 'rise and fall' in terms of
popularity of a certain page over time, and maybe try to find a relation between the evolution of the
article and its relative size by looking at the total amount of transferred bytes.
But you can start simpler: by extracting the top [N] viewed pages per language code. You can use the
empty MapReduce classes from the previous chapter as a starting point.

The setup
Our Mapper will output the language code as key, and the page views and article title as value – for
each line in our input file.
Our Reducer – which gets the data after the shuffle phase is done and all values are sorted – will get
all pages associated with a single language code. The Reducer will maintain a top [N] list of the pages
it has seen, and output this list when it has checked all values.
The implementation of Tool we will use has the responsibility to read a single argument: [N]. It further
more needs to tell the Job how to handle our Mapper and Reducer, particularly about the expected
InputFormat and OutputFormat, and the outputKeyClass and outputValueClass.


Our Tool
@JavaDoc:
https://subtrac.sara.nl/oss/svn/hadoop/branches/WikipediaPageCounts/doc/javadoc/nl/sara/hadoop/RunnerTool.html
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.html
@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/lib/output/TextOutputFormat.html

See here the steps you need to take to get a functional implementation of Tool for this job. Hint: use the
previous chapters if you miss information, and try to get familiar with the API's by looking at the
documentation.

    1. Our Tool will accept a single argument, N. It will have to pass the argument on from the
       main() method to the run() method – keeping missing input in mind, of course. After that it
       should use the Configuration.set() method to pass the configuration on to the job.
    2. Since we are dealing with plain text, organized in single lines, we can use Hadoop's native
       TextInputFormat type to deal with our input file and create FileSplits for our Mapper.
    3. The output will be lines in the form of [language_code] [article_name] [page_views]. We can
       easily store this as plain text, so we can use Hadoop's TextOutputFormat.
    4. Since both the key and the value we will store in our TextOutputFormat will be of type
       Text, we should tell our job to expect these types.
Our Mapper
@Javadoc:
https://subtrac.sara.nl/oss/svn/hadoop/branches/WikipediaPageCounts/doc/javadoc/nl/sara/hadoop/MyMapper.html


Our Mapper is trivially simple. It needs to split the input value (the TextInputFormat gives the line
itself as value, and the position of the first character of the line in the file as key) on spaces. If that was
successful, it should output the first word – the language code – as key, and the remainder as value.
Even though this is a trivial action and can be written as a single line of code, make sure to deal with
Exceptions. You cannot expect every line in the text to be structured the exact same, and a fact of life
is that most datasets you will work with do not strictly apply to structure. Fault tolerance can be
achieved by using many try / catch blocks in your code, while (especially during development) logging
all entries that raise an Exception.

Our Reducer
@Javadoc:
https://subtrac.sara.nl/oss/svn/hadoop/branches/WikipediaPageCounts/doc/javadoc/nl/sara/hadoop/MyReducer.html


The Reducer is a bit less trivial. We want to loop over all values we receive for a certain key – a
language code in our case -, and maintain a top [N] of most viewed pages. Every time we process a
new value, we should check whether it is higher than the lowest value in our top [N], and replace the
lowest value with the current one if it is.
A TreeMap object comes in handy for storing the top [N], since it stores its values sorted. This makes
finding the current lowest value of your top very easy.

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hoodAdarsh Pannu
 
Hive dirty/beautiful hacks in TD
Hive dirty/beautiful hacks in TDHive dirty/beautiful hacks in TD
Hive dirty/beautiful hacks in TDSATOSHI TAGOMORI
 
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Akhil Das
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Using pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 previewUsing pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 previewMario Cartia
 
Spark For The Business Analyst
Spark For The Business AnalystSpark For The Business Analyst
Spark For The Business AnalystGustaf Cavanaugh
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...CloudxLab
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xincaidezhi655
 
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internalDavid Lauzon
 
Getting started in Apache Spark and Flink (with Scala) - Part II
Getting started in Apache Spark and Flink (with Scala) - Part IIGetting started in Apache Spark and Flink (with Scala) - Part II
Getting started in Apache Spark and Flink (with Scala) - Part IIAlexander Panchenko
 
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with ScalaHarnessing the power of Nutch with Scala
Harnessing the power of Nutch with ScalaKnoldus Inc.
 

Was ist angesagt? (20)

Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Unit 2 part-2
Unit 2 part-2Unit 2 part-2
Unit 2 part-2
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Hive dirty/beautiful hacks in TD
Hive dirty/beautiful hacks in TDHive dirty/beautiful hacks in TD
Hive dirty/beautiful hacks in TD
 
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Using pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 previewUsing pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 preview
 
Spark For The Business Analyst
Spark For The Business AnalystSpark For The Business Analyst
Spark For The Business Analyst
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xin
 
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
 
Getting started in Apache Spark and Flink (with Scala) - Part II
Getting started in Apache Spark and Flink (with Scala) - Part IIGetting started in Apache Spark and Flink (with Scala) - Part II
Getting started in Apache Spark and Flink (with Scala) - Part II
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Unit 4-apache pig
Unit 4-apache pigUnit 4-apache pig
Unit 4-apache pig
 
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with ScalaHarnessing the power of Nutch with Scala
Harnessing the power of Nutch with Scala
 

Ähnlich wie Hadoop Hackathon Reader

Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examplesAndrea Iacono
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questionsKalyan Hadoop
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsAsad Masood Qazi
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slidesryancox
 
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinHigh-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinPietro Michiardi
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfishFei Dong
 
Talend openstudio bigdata_gettingstarted_6.3.0_en
Talend openstudio bigdata_gettingstarted_6.3.0_enTalend openstudio bigdata_gettingstarted_6.3.0_en
Talend openstudio bigdata_gettingstarted_6.3.0_enManoj Sharma
 
Learn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node ClusterLearn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node ClusterEdureka!
 
R Data Access from hdfs,spark,hive
R Data Access  from hdfs,spark,hiveR Data Access  from hdfs,spark,hive
R Data Access from hdfs,spark,hivearunkumar sadhasivam
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkManish Gupta
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterDataWorks Summit
 
Apache spark installation [autosaved]
Apache spark installation [autosaved]Apache spark installation [autosaved]
Apache spark installation [autosaved]Shweta Patnaik
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectMao Geng
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview questionpappupassindia
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxHARIKRISHNANU13
 
DistributingSoftwareKnowledgeForDevOps
DistributingSoftwareKnowledgeForDevOpsDistributingSoftwareKnowledgeForDevOps
DistributingSoftwareKnowledgeForDevOpsPaul Worrall
 

Ähnlich wie Hadoop Hackathon Reader (20)

Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questions
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
 
Unit 2
Unit 2Unit 2
Unit 2
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinHigh-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig Latin
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfish
 
Talend openstudio bigdata_gettingstarted_6.3.0_en
Talend openstudio bigdata_gettingstarted_6.3.0_enTalend openstudio bigdata_gettingstarted_6.3.0_en
Talend openstudio bigdata_gettingstarted_6.3.0_en
 
Learn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node ClusterLearn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node Cluster
 
R Data Access from hdfs,spark,hive
R Data Access  from hdfs,spark,hiveR Data Access  from hdfs,spark,hive
R Data Access from hdfs,spark,hive
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
 
Apache spark installation [autosaved]
Apache spark installation [autosaved]Apache spark installation [autosaved]
Apache spark installation [autosaved]
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
DistributingSoftwareKnowledgeForDevOps
DistributingSoftwareKnowledgeForDevOpsDistributingSoftwareKnowledgeForDevOps
DistributingSoftwareKnowledgeForDevOps
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 

Mehr von Evert Lammerts

Notes on data-intensive processing with Hadoop Mapreduce
Notes on data-intensive processing with Hadoop MapreduceNotes on data-intensive processing with Hadoop Mapreduce
Notes on data-intensive processing with Hadoop MapreduceEvert Lammerts
 
Introduction NL-HUG (April)
Introduction NL-HUG (April)Introduction NL-HUG (April)
Introduction NL-HUG (April)Evert Lammerts
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridEvert Lammerts
 
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache HadoopFirst NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache HadoopEvert Lammerts
 
Hadoop voor niet-technici
Hadoop voor niet-techniciHadoop voor niet-technici
Hadoop voor niet-techniciEvert Lammerts
 
Scientific computing in The Netherlands
Scientific computing in The NetherlandsScientific computing in The Netherlands
Scientific computing in The NetherlandsEvert Lammerts
 
Large-Scale Data Storage and Processing for Scientists with Hadoop
Large-Scale Data Storage and Processing for Scientists with HadoopLarge-Scale Data Storage and Processing for Scientists with Hadoop
Large-Scale Data Storage and Processing for Scientists with HadoopEvert Lammerts
 
Introduction to SARA's Hadoop Hackathon - dec 7th 2010
Introduction to SARA's Hadoop Hackathon - dec 7th 2010Introduction to SARA's Hadoop Hackathon - dec 7th 2010
Introduction to SARA's Hadoop Hackathon - dec 7th 2010Evert Lammerts
 

Mehr von Evert Lammerts (8)

Notes on data-intensive processing with Hadoop Mapreduce
Notes on data-intensive processing with Hadoop MapreduceNotes on data-intensive processing with Hadoop Mapreduce
Notes on data-intensive processing with Hadoop Mapreduce
 
Introduction NL-HUG (April)
Introduction NL-HUG (April)Introduction NL-HUG (April)
Introduction NL-HUG (April)
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
 
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache HadoopFirst NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
 
Hadoop voor niet-technici
Hadoop voor niet-techniciHadoop voor niet-technici
Hadoop voor niet-technici
 
Scientific computing in The Netherlands
Scientific computing in The NetherlandsScientific computing in The Netherlands
Scientific computing in The Netherlands
 
Large-Scale Data Storage and Processing for Scientists with Hadoop
Large-Scale Data Storage and Processing for Scientists with HadoopLarge-Scale Data Storage and Processing for Scientists with Hadoop
Large-Scale Data Storage and Processing for Scientists with Hadoop
 
Introduction to SARA's Hadoop Hackathon - dec 7th 2010
Introduction to SARA's Hadoop Hackathon - dec 7th 2010Introduction to SARA's Hadoop Hackathon - dec 7th 2010
Introduction to SARA's Hadoop Hackathon - dec 7th 2010
 

Kürzlich hochgeladen

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 

Kürzlich hochgeladen (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 

Hadoop Hackathon Reader

  • 1. SARA Hadoop Hackathon December 2010 Table of Contents An introduction to Java MapReduce jobs in Apache Hadoop...................................................................1 org.apache.hadoop.mapreduce.InputFormat.........................................................................................1 org.apache.hadoop.mapreduce.Mapper.................................................................................................1 org.apache.hadoop.io.SequenceFile and org.apache.hadoop.mapreduce.Partitioner............................2 org.apache.hadoop.mapreduce.Reducer................................................................................................2 org.apache.hadoop.mapreduce.OutputFormat.......................................................................................3 An empty Hadoop MapReduce job in Java................................................................................................3 org.apache.hadoop.util.Tool..................................................................................................................3 org.apache.hadoop.mapreduce.Mapper.................................................................................................5 org.apache.hadoop.mapreduce.Reducer................................................................................................6 A simple try-out: top Wikipedia page views..............................................................................................6 The setup...............................................................................................................................................7 Our Tool.................................................................................................................................................7 Our Mapper...........................................................................................................................................7 Our Reducer..........................................................................................................................................8 An introduction to Java MapReduce jobs in Apache Hadoop A MapReduce job written in Java typically exists of the following components: 1. An InputFormat 2. A Mapper 3. A SequenceFile and Partioner 4. A Reducer 5. An OutputFormat org.apache.hadoop.mapreduce.InputFormat @See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/InputFormat.html @See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/InputSplit.html @See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/RecordReader.html It is the InputFormat's reponsibility to: • Validate the input of the Job • Split up the input into logical InputSplits, which will be assigned to each Mapper • Provide an implementation of a RecordReader, which is used by a Mapper to read input records from the logical InputSplit. org.apache.hadoop.mapreduce.Mapper @See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/Mapper.html @See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/RecordReader.html @See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/InputSplit.html
  • 2. @See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/Mapper.Context.html A Mapper implements application specific logic. It reades a set of key / value pairs as input from a RecordReader, and generates a set of key / value pairs as output. A Mapper should override a function map: public void map(KEYIN key, VALUEIN value, Mapper.Context context)  Every time a Mapper gets initialized – which happens once for each InputSplit – a function is called to setup the Object. You can optionally override this function and do your own setup: public void setup(Mapper.Context context)  Similarly, you can override a cleanup function that is called when the Mapper object is destoyed: public void cleanup(Mapper.Context context)  Output from a Mapper is collected from within Mapper.map(). The Context object, provided as parameter to the function, exposes a function that must be used for this task: public void write(KEYOUT key, VALUEOUT value) org.apache.hadoop.io.SequenceFile and org.apache.hadoop.mapreduce.Partitioner @See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/SequenceFile.html @See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/Partitioner.html @See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/SequenceFile.Sorter.html Temporary outputs from the Mappers are stored in SequenceFile. This is a binary representation of key / value pairs. A SequenceFile object provides a: • java.io.Reader • java.io.Writer • and SequenceFile.Sorter. If the job is configured to use more than one Reducer then the sorted SequenceFile is partitioned by a Partioner, creating as many partitions as Reducers. The partitioning is done by executing some function on each key in the SequenceFile, typically a hash function. Each Reducer then fetches a range of keys, assembled from all SequenceFiles produced by the Mappers, over the internal network using HTTP. These individual sorted ranges are then merged into a single sorted range. These events are usually collectively referred to as the “shuffle phase”. org.apache.hadoop.mapreduce.Reducer @See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/Reducer.html A Reducer, like a Mapper, implements application specific logic. You can draw an analogy with SQL to understand the distinction. In an SQL SELECT query, the input data (a table) is filtered by zero or more conditions in a WHERE clause. The resulting data is optionally grouped, maybe because of a GROUP BY clause, and after that the aggregate functions can be applied (SUM(), AVG(), COUNT(), etcetera). The conditional logic of a query in MapReduce terms, is done by the Mapper. When the
  • 3. Mappers are finished, the resulting data is sorted on the keys. The Reducers take care of the aggregate functions (and can be arbitrarily complex). (This analogy is actually part of a discussion going on for some time now1.) A Reducer, after having completed the shuffle phase, has a number of keys, each with one or more values, to apply its logic to. Like Mapper, Reducer has setup and cleanup functions that can be overridden. The application logic is applied through the function: public void reduce(KEYIN key, Iterable<VALUEIN> values, Reducer.Context context) org.apache.hadoop.mapreduce.OutputFormat @See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/OutputFormat.html @See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/RecordWriter.html @See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/fs/FileSystem.html The OutputFormat is responsible for: • validating the job's output specification • Provide an implementation of RecordWriter to be used to write output files of the job. The output is written to a FileSystem. An empty Hadoop MapReduce job in Java Any MapReduce job in Java implements a minimum of three classes: • a Tool • a Mapper • and a Reducer An implementation of an empty MapReduce job that can be used as base for new jobs, can be found in a SARA Subversion repository https://subtrac.sara.nl/oss/svn/hadoop/trunk/BaseProject/. Read-only access is provided for anonymous users. The example code in this document is simplified code from this repository. org.apache.hadoop.util.Tool @See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/util/Tool.html @See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/conf/Configurable.html @See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/conf/Configured.html @See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/util/ToolRunner.html @See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/conf/Configuration.html @See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/Job.html @See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html @See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.html 1 http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html
  • 4. An implementation of Tool is the single point of entry for any Hadoop MapReduce application. The implementing class should expose a main() method. It is commonly used to configure the job – either through the parsing of command-line options, static configuration in the code itself or a combination of both. The Tool interface itself, has Configurable as its Superinterface. Therefore, an implementation of tool must either subclass an implementation of Configurable or implement the interface itself. The typical Hadoop MapReduce application subclasses Configured, which is an implementation of Configurable. Next to the main method, an implementation of Tool should override: public int run(String[] args) throws Exception The run method is responsible for actually configuring and running the Job. See here a simplified implementation of Tool. Below the code is a step-by-step explanation. public class RunnerTool extends Configured implements Tool {     /**      * An org.apache.commons.logging.Log object      */     private static final Log LOG = LogFactory.getLog(RunnerTool.class.getName());     /**      * This function handles configuration and submission of your      * MapReduce job.      * @return 1 on failure, 0 on success      * @throws Exception      */     @Override     public int run(String[] arg0) throws Exception {         Configuration conf = getConf();         Job job = new Job(conf);         job.setJarByClass(RunnerTool.class);         job.setMapperClass(MyMapper.class);         job.setReducerClass(MyReducer.class);                  FileInputFormat.addInputPath(job, new Path("in­dir"));         FileOutputFormat.setOutputPath(job, new Path("out­dir­" + System.nanoTime()));                  if (!job.waitForCompletion(true)) {             LOG.error("Job failed!");             return 1;         }         return 0;     }          /**      * Main method. Runs the application.      * @param args      * @throws Exception      */     public static void main(String... args) throws Exception {         System.exit(ToolRunner.run(new RunnerTool(), args));     } } 1. The main method uses the static ToolRunner.run() method. This method parses generic
  • 5. Hadoop command line options2 and, if necessary, modifies the Configuration object. After that it calls RunnerTool.run(). 2. Our RunnerTool.run() method starts by fetching the job's Configuration object. The object can then be used to further configure the job, using the set*() functions. 3. Then a Job is being created using the Configuration object, and we let the Job know what jar it came from by calling its setJarByClass() method. 4. We need to tell our Job which Mapper and Reducer it should use by calling the setMapperByClass() and setReducerByClass() methods. 5. Now we tell the Job on what data it will operate (FileInputFormat.addInputPath(), call once for each file) and where it should store it's output data (FileOutputFormat.setOutputPath()) (note: the output directory should not yet exist!). 6. The Job has all information it needs now, and is being submitted by calling job.waitForCompletion(). org.apache.hadoop.mapreduce.Mapper @See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/LongWritable.html @See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/Text.html Our empty Mapper class only provides us the setup() and map() functions. It is worth to note that, using Java generics, we tell the Mapper that the type of: 1. the input key will be LongWritable (an object wrapper for the long datatype) 2. the input value will be Text (an object wrapper for text) 3. the output key will be LongWritable as well 4. the output value will be Text. public class MyMapper extends Mapper<LongWritable, Text, LongWritable, Text> {     /**      * An org.apache.commons.logging.Log object      */     private static final Log LOG = LogFactory.getLog(RunnerTool.class.getName());     /**      * This function is called once during the start of the map phase.      * @param context The job Context      */     @Override     public void setup(Context context) {              }     /**      * This function holds the mapper logic      * @param key The key of the K/V input pair      * @param value The value of the K/V input pair      * @param context The context of the application 2 http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/util/GenericOptionsParser.html#GenericOptions
  • 6.      */     @Override     public void map(LongWritable key, Text value, Context context) {              } } org.apache.hadoop.mapreduce.Reducer @See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/LongWritable.html @See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/Text.html Our empty Reducer, like the Mapper, only provides the setup() and reduce() functions. Also like our Mapper, using Java generics, we tell the Reducer that the type of: 1. the input key will be LongWritable 2. the input value will be Text 3. the output key will be LongWritable as well 4. the output value will be Text. public class MyReducer extends Reducer<LongWritable, Text, LongWritable, Text> {     /**      * The LOG Object      */     private static final Log LOG = LogFactory.getLog(RunnerTool.class.getName());     /**      * This function is called once during the start of the reduce phase.      * @param context The job Context      */     @Override     public void setup(Context context) {              }          /**      * This function holds the reducer logic.      * @param key The key of the input K/V pair      * @param values Values associated with key      * @param context The context of the application      */     @Override     public void reduce(LongWritable key, Iterable<Text> values, Context context) {              } } A simple try-out: top Wikipedia page views Courtesy of Edgar Meij3, UvA ILPS, we have access to a sample dataset containing the amount of page views per article, per language code, during a single hour. The data is structured as follows: [language_code] [article_name] [page_views] [transfered_bytes] 3 edgar.meij@uva.nl
  • 7. In the below example data, the English language article about Amsterdam has been viewed 215 times during a certain hour, and these views generated a total of 23312999 bytes (~23MB) of traffic. en Amsterdam 215 23312999 You can download the sample dataset from https://subtrac.sara.nl/oss/svn/hadoop/branches/WikipediaPageCounts/in-dir/. Data like this could give us an interesting view on the usage of Wikipedia. Say we have this data collected over a period of months or even longer. We would be able to see the 'rise and fall' in terms of popularity of a certain page over time, and maybe try to find a relation between the evolution of the article and its relative size by looking at the total amount of transferred bytes. But you can start simpler: by extracting the top [N] viewed pages per language code. You can use the empty MapReduce classes from the previous chapter as a starting point. The setup Our Mapper will output the language code as key, and the page views and article title as value – for each line in our input file. Our Reducer – which gets the data after the shuffle phase is done and all values are sorted – will get all pages associated with a single language code. The Reducer will maintain a top [N] list of the pages it has seen, and output this list when it has checked all values. The implementation of Tool we will use has the responsibility to read a single argument: [N]. It further more needs to tell the Job how to handle our Mapper and Reducer, particularly about the expected InputFormat and OutputFormat, and the outputKeyClass and outputValueClass. Our Tool @JavaDoc: https://subtrac.sara.nl/oss/svn/hadoop/branches/WikipediaPageCounts/doc/javadoc/nl/sara/hadoop/RunnerTool.html @See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.html @See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/lib/output/TextOutputFormat.html See here the steps you need to take to get a functional implementation of Tool for this job. Hint: use the previous chapters if you miss information, and try to get familiar with the API's by looking at the documentation. 1. Our Tool will accept a single argument, N. It will have to pass the argument on from the main() method to the run() method – keeping missing input in mind, of course. After that it should use the Configuration.set() method to pass the configuration on to the job. 2. Since we are dealing with plain text, organized in single lines, we can use Hadoop's native TextInputFormat type to deal with our input file and create FileSplits for our Mapper. 3. The output will be lines in the form of [language_code] [article_name] [page_views]. We can easily store this as plain text, so we can use Hadoop's TextOutputFormat. 4. Since both the key and the value we will store in our TextOutputFormat will be of type Text, we should tell our job to expect these types.
  • 8. Our Mapper @Javadoc: https://subtrac.sara.nl/oss/svn/hadoop/branches/WikipediaPageCounts/doc/javadoc/nl/sara/hadoop/MyMapper.html Our Mapper is trivially simple. It needs to split the input value (the TextInputFormat gives the line itself as value, and the position of the first character of the line in the file as key) on spaces. If that was successful, it should output the first word – the language code – as key, and the remainder as value. Even though this is a trivial action and can be written as a single line of code, make sure to deal with Exceptions. You cannot expect every line in the text to be structured the exact same, and a fact of life is that most datasets you will work with do not strictly apply to structure. Fault tolerance can be achieved by using many try / catch blocks in your code, while (especially during development) logging all entries that raise an Exception. Our Reducer @Javadoc: https://subtrac.sara.nl/oss/svn/hadoop/branches/WikipediaPageCounts/doc/javadoc/nl/sara/hadoop/MyReducer.html The Reducer is a bit less trivial. We want to loop over all values we receive for a certain key – a language code in our case -, and maintain a top [N] of most viewed pages. Every time we process a new value, we should check whether it is higher than the lowest value in our top [N], and replace the lowest value with the current one if it is. A TreeMap object comes in handy for storing the top [N], since it stores its values sorted. This makes finding the current lowest value of your top very easy.