SlideShare ist ein Scribd-Unternehmen logo
1 von 24
Hadoop MapReduce
framework
Hadoop Data Types (http://hadoop.apache.org/docs/current/api/index.html)
org.apache.hadoop.io
• int -> IntWritable , long -> LongWritable , boolean -> BooleanWritable , float -> FloatWritable , byte -> ByteWritable
We can use the following built-in data types as key and value
• Text :This stores a UTF8 text
• ByteWritable : This stores a sequence of bytes
• VIntWritable and VLongWritable : These stores variable length integer and long values
• Nullwritable: This is zero-length Writable type that can be used when you don’t want to use a key or value type
• Key class, should implement the WritableComparable interface.
• Value class, should implement the of a Writable interface.
E.g.
public class IntWritable implements WritableComparable
public abstract interface WritableComparable<T> extends Writable, Comparable<T>
Hadoop Data Types Contd..
MapReduce paradigm
• Splits input files into blocks (typically of 64 MB each)
• Operates on key/value pairs
• Mappers filter & transform input data
• Reducers aggregate mappers output
• Efficient way to process the cluster:
• Move code to data
• Run code on all machines
• Divide & conquer: partition a large problem into smaller sub
problems
• Independent sub-problems can be executed in parallel by workers (anything
from threads to clusters)
• Intermediate results from each worker are combined to get the final result
MapReduce paradigm contd..
• Challenges:
• How to transform a problem into sub-problems?
• How to assign workers and synchronize the intermediate results?
• How do the workers get the required data?
• How to handle failures in the cluster?
Map and Reduce tasks
Shuffle and Sort
MapReduce Execution Framework
Combiners
• Combiner: local aggregation of key/value pairs after map() and before the shuffle
& sort phase (occurs on the same machine as map())
• Also called “mini-reducer”
• Instead of emitting 100 times (the,1), the combiner emits (the,100)
• Can lead to great speed-ups and save network bandwidth
• Each combiner operates in isolation, has no access to other mapper’s key/value
pairs
• A combiner cannot be assumed to process all values associated with the same
key (may not run at all! Hadoop’s decision)
• Emitted key/value pairs must be the same as those emitted by the mapper
Combiners contd..
• If the function computed is
• Commutative [a + b = b + a]
• Associative [a + (b + c) = (a + b) + c]
Reducer can be reused as combiner
Max function works:
max (max ( a, b), max (c, d, e)) = max (a, b, c, d, e)
Mean function does not work:
mean(mean(a, b), mean(c, d, e)) != mean (a, b, c, d, e)
MapReduce Programming: Word Count
WordCountDriver.java
public class WordCountDriver extends Configured implements Tool {
public int run(String[] args) throws Exception {
Configuration conf = new Configuration();
GenericOptionsParser parser = new GenericOptionsParser(conf, args);
args = parser.getRemainingArgs();
Job job = new Job(conf, "wordcount"); job.setJarByClass(WordCountDriver.class);
job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path("E:aainputnames.txt"));
FileOutputFormat.setOutputPath(job, new Path("E:aaoutput"));
job.setMapperClass(WordCountMapper.class); job.setReducerClass(WordCountReducer.class);
if (job.waitForCompletion(true)) { return 1;
} else { return 0;
}}}
MapReduce Programming: Word Count
WordCountMapper.java
public class WordCountMapper extends Mapper<LongWritable, Text,
Text, IntWritable>{
private Text word = new Text();
private final static IntWritable one = new IntWritable(1);
public void map(LongWritable key, Text value, Context
context)
throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}}}
MapReduce Programming: Word Count
WordCountReducer.java
public class WordCountReducer extends
Reducer<Text, IntWritable, Text, IntWritable> {
protected void reduce(Text key, Iterable<IntWritable>
values,Context context) throws IOException,
InterruptedException {
int sum = 0;
for (IntWritable value : values) {
sum += value.get();
}
context.write(key, new IntWritable(sum));
}
}
A minimal MapReduce driver
public class MinimalMapReduceWithDefaults extends Configured implements Tool {
public int run(String[] args) throws Exception {
Job job = new Job(getConf());
job.setInputFormatClass(TextInputFormat.class);
job.setMapperClass(Mapper.class);
job.setMapOutputKeyClass(LongWritable.class);
job.setMapOutputValueClass(Text.class);
job.setPartitionerClass(HashPartitioner.class);
job.setNumReduceTasks(1);
job.setReducerClass(Reducer.class);
job.setOutputKeyClass(LongWritable.class);
job.setOutputValueClass(Text.class);
job.setOutputFormatClass(TextOutputFormat.class);
return job.waitForCompletion(true) ? 0 : 1; }
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new MinimalMapReduceWithDefaults(), args);
System.exit(exitCode);}}
Input Splits and Records
• Input split is a chunk of the input that is processed by a single map.
• Each map processes a single split.
• Each split is divided into records, and the map processes each
record—a key-value pair—in turn.
public abstract class InputSplit {
public abstract long getLength() throws IOException, InterruptedException;
public abstract String[] getLocations() throws IOException, InterruptedException;
}
InputFormat
• An InputFormat is responsible for creating the input splits and
dividing them into records.
public abstract class InputFormat<K, V> {
public abstract List<InputSplit> getSplits(JobContext context) throws
IOException, InterruptedException;
public abstract RecordReader<K, V> createRecordReader(InputSplit split,
TaskAttemptContext context) throws IOException, InterruptedException;
}
InputFormat class hierarchy
FileInputFormat
• A place to define which files are included as the input to a job.
• An implementation for generating splits for the input files.
FileInputFormat input paths
public static void addInputPath(Job job, Path path)
public static void setInputPaths(Job job, Path... inputPaths)
FileInputFormat input splits
max(minimumSize, min(maximumSize, blockSize))
by default: minimumSize < blockSize < maximumSize
How to control the split size?
Text Input : TextInputFormat
• TextInputFormat is the default InputFormat.
• Each record is a line of input.
• The key, a LongWritable, is the byte offset within the file of the beginning of the
line.
• The value is the contents of the line, excluding any line terminators (newline,
carriage return), and is packaged as a Text object.
Binary Input: SequenceFileInputFormat
• Hadoop’s sequence file format stores sequences of binary key-value
pairs.
• Sequence files are well suited as a format for MapReduce data since
they are splitable.
• Support compression as a part of the format.
Multiple Inputs
MultipleInputs.addInputPath(job, ABCInputPath,TextInputFormat.class,
MapperABC.class);
MultipleInputs.addInputPath(job, XYZInputPath, TextInputFormat.class,
MapperXYZ.class);
Output Formats Class Hierarchy
Output Types
• Text Output
• The default output format, TextOutputFormat, writes records as lines of text.
• Binary Output
• SequenceFileOutputFormat writes sequence files for its output.
• Multiple Outputs
• MultipleOutputs allows you to write data to files whose names are derived
from the output keys and values, or in fact from an arbitrary string.

Weitere ähnliche Inhalte

Was ist angesagt?

Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
Marin Dimitrov
 

Was ist angesagt? (20)

Map reduce in Hadoop
Map reduce in HadoopMap reduce in Hadoop
Map reduce in Hadoop
 
Unit 2 part-2
Unit 2 part-2Unit 2 part-2
Unit 2 part-2
 
Unit 4-apache pig
Unit 4-apache pigUnit 4-apache pig
Unit 4-apache pig
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce Applications
 
Relational Algebra and MapReduce
Relational Algebra and MapReduceRelational Algebra and MapReduce
Relational Algebra and MapReduce
 
Unit 2
Unit 2Unit 2
Unit 2
 
Unit 4 lecture2
Unit 4 lecture2Unit 4 lecture2
Unit 4 lecture2
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Unit 3
Unit 3Unit 3
Unit 3
 
Unit 3 writable collections
Unit 3 writable collectionsUnit 3 writable collections
Unit 3 writable collections
 
Unit 3 lecture-2
Unit 3 lecture-2Unit 3 lecture-2
Unit 3 lecture-2
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Map/Reduce intro
Map/Reduce introMap/Reduce intro
Map/Reduce intro
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
 
The Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXThe Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphX
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
An Introduction To Map-Reduce
An Introduction To Map-ReduceAn Introduction To Map-Reduce
An Introduction To Map-Reduce
 

Andere mochten auch

Andere mochten auch (6)

Advance HBase and Zookeeper - Module 8
Advance HBase and Zookeeper - Module 8Advance HBase and Zookeeper - Module 8
Advance HBase and Zookeeper - Module 8
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2
Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2
Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce Framework
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 

Ähnlich wie Hadoop MapReduce framework - Module 3

Hadoop_Pennonsoft
Hadoop_PennonsoftHadoop_Pennonsoft
Hadoop_Pennonsoft
PennonSoft
 
JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop Papyrus
Koichi Fujikawa
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
Subhas Kumar Ghosh
 
An Overview Of Python With Functional Programming
An Overview Of Python With Functional ProgrammingAn Overview Of Python With Functional Programming
An Overview Of Python With Functional Programming
Adam Getchell
 
Analysis Is Painless
Analysis Is PainlessAnalysis Is Painless
Analysis Is Painless
Omer Trajman
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
gothicane
 

Ähnlich wie Hadoop MapReduce framework - Module 3 (20)

Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 
Hadoop_Pennonsoft
Hadoop_PennonsoftHadoop_Pennonsoft
Hadoop_Pennonsoft
 
JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop Papyrus
 
Big-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbaiBig-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbai
 
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInScalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedIn
 
Hadoop - Introduction to mapreduce
Hadoop -  Introduction to mapreduceHadoop -  Introduction to mapreduce
Hadoop - Introduction to mapreduce
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
Beyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingBeyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel Processing
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
 
C
CC
C
 
An Overview Of Python With Functional Programming
An Overview Of Python With Functional ProgrammingAn Overview Of Python With Functional Programming
An Overview Of Python With Functional Programming
 
Analysis Is Painless
Analysis Is PainlessAnalysis Is Painless
Analysis Is Painless
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Advance MapReduce Concepts - Module 4
Advance MapReduce Concepts - Module 4Advance MapReduce Concepts - Module 4
Advance MapReduce Concepts - Module 4
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
map reduce Technic in big data
map reduce Technic in big data map reduce Technic in big data
map reduce Technic in big data
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operation
 

Mehr von Rohit Agrawal

Mehr von Rohit Agrawal (7)

Apache Oozie Workflow Scheduler - Module 10
Apache Oozie Workflow Scheduler - Module 10Apache Oozie Workflow Scheduler - Module 10
Apache Oozie Workflow Scheduler - Module 10
 
Hadoop 2.0, MRv2 and YARN - Module 9
Hadoop 2.0, MRv2 and YARN - Module 9Hadoop 2.0, MRv2 and YARN - Module 9
Hadoop 2.0, MRv2 and YARN - Module 9
 
Advance Hive, NoSQL Database (HBase) - Module 7
Advance Hive, NoSQL Database (HBase) - Module 7Advance Hive, NoSQL Database (HBase) - Module 7
Advance Hive, NoSQL Database (HBase) - Module 7
 
Pig and Pig Latin - Module 5
Pig and Pig Latin - Module 5Pig and Pig Latin - Module 5
Pig and Pig Latin - Module 5
 
Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Hive and HiveQL - Module6
Hive and HiveQL - Module6Hive and HiveQL - Module6
Hive and HiveQL - Module6
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 

Hadoop MapReduce framework - Module 3

  • 2. Hadoop Data Types (http://hadoop.apache.org/docs/current/api/index.html) org.apache.hadoop.io • int -> IntWritable , long -> LongWritable , boolean -> BooleanWritable , float -> FloatWritable , byte -> ByteWritable We can use the following built-in data types as key and value • Text :This stores a UTF8 text • ByteWritable : This stores a sequence of bytes • VIntWritable and VLongWritable : These stores variable length integer and long values • Nullwritable: This is zero-length Writable type that can be used when you don’t want to use a key or value type • Key class, should implement the WritableComparable interface. • Value class, should implement the of a Writable interface. E.g. public class IntWritable implements WritableComparable public abstract interface WritableComparable<T> extends Writable, Comparable<T>
  • 4. MapReduce paradigm • Splits input files into blocks (typically of 64 MB each) • Operates on key/value pairs • Mappers filter & transform input data • Reducers aggregate mappers output • Efficient way to process the cluster: • Move code to data • Run code on all machines • Divide & conquer: partition a large problem into smaller sub problems • Independent sub-problems can be executed in parallel by workers (anything from threads to clusters) • Intermediate results from each worker are combined to get the final result
  • 5. MapReduce paradigm contd.. • Challenges: • How to transform a problem into sub-problems? • How to assign workers and synchronize the intermediate results? • How do the workers get the required data? • How to handle failures in the cluster?
  • 9. Combiners • Combiner: local aggregation of key/value pairs after map() and before the shuffle & sort phase (occurs on the same machine as map()) • Also called “mini-reducer” • Instead of emitting 100 times (the,1), the combiner emits (the,100) • Can lead to great speed-ups and save network bandwidth • Each combiner operates in isolation, has no access to other mapper’s key/value pairs • A combiner cannot be assumed to process all values associated with the same key (may not run at all! Hadoop’s decision) • Emitted key/value pairs must be the same as those emitted by the mapper
  • 10. Combiners contd.. • If the function computed is • Commutative [a + b = b + a] • Associative [a + (b + c) = (a + b) + c] Reducer can be reused as combiner Max function works: max (max ( a, b), max (c, d, e)) = max (a, b, c, d, e) Mean function does not work: mean(mean(a, b), mean(c, d, e)) != mean (a, b, c, d, e)
  • 11. MapReduce Programming: Word Count WordCountDriver.java public class WordCountDriver extends Configured implements Tool { public int run(String[] args) throws Exception { Configuration conf = new Configuration(); GenericOptionsParser parser = new GenericOptionsParser(conf, args); args = parser.getRemainingArgs(); Job job = new Job(conf, "wordcount"); job.setJarByClass(WordCountDriver.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.setInputPaths(job, new Path("E:aainputnames.txt")); FileOutputFormat.setOutputPath(job, new Path("E:aaoutput")); job.setMapperClass(WordCountMapper.class); job.setReducerClass(WordCountReducer.class); if (job.waitForCompletion(true)) { return 1; } else { return 0; }}}
  • 12. MapReduce Programming: Word Count WordCountMapper.java public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{ private Text word = new Text(); private final static IntWritable one = new IntWritable(1); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); }}}
  • 13. MapReduce Programming: Word Count WordCountReducer.java public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> { protected void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) { sum += value.get(); } context.write(key, new IntWritable(sum)); } }
  • 14. A minimal MapReduce driver public class MinimalMapReduceWithDefaults extends Configured implements Tool { public int run(String[] args) throws Exception { Job job = new Job(getConf()); job.setInputFormatClass(TextInputFormat.class); job.setMapperClass(Mapper.class); job.setMapOutputKeyClass(LongWritable.class); job.setMapOutputValueClass(Text.class); job.setPartitionerClass(HashPartitioner.class); job.setNumReduceTasks(1); job.setReducerClass(Reducer.class); job.setOutputKeyClass(LongWritable.class); job.setOutputValueClass(Text.class); job.setOutputFormatClass(TextOutputFormat.class); return job.waitForCompletion(true) ? 0 : 1; } public static void main(String[] args) throws Exception { int exitCode = ToolRunner.run(new MinimalMapReduceWithDefaults(), args); System.exit(exitCode);}}
  • 15. Input Splits and Records • Input split is a chunk of the input that is processed by a single map. • Each map processes a single split. • Each split is divided into records, and the map processes each record—a key-value pair—in turn. public abstract class InputSplit { public abstract long getLength() throws IOException, InterruptedException; public abstract String[] getLocations() throws IOException, InterruptedException; }
  • 16. InputFormat • An InputFormat is responsible for creating the input splits and dividing them into records. public abstract class InputFormat<K, V> { public abstract List<InputSplit> getSplits(JobContext context) throws IOException, InterruptedException; public abstract RecordReader<K, V> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException; }
  • 18. FileInputFormat • A place to define which files are included as the input to a job. • An implementation for generating splits for the input files. FileInputFormat input paths public static void addInputPath(Job job, Path path) public static void setInputPaths(Job job, Path... inputPaths) FileInputFormat input splits max(minimumSize, min(maximumSize, blockSize)) by default: minimumSize < blockSize < maximumSize
  • 19. How to control the split size?
  • 20. Text Input : TextInputFormat • TextInputFormat is the default InputFormat. • Each record is a line of input. • The key, a LongWritable, is the byte offset within the file of the beginning of the line. • The value is the contents of the line, excluding any line terminators (newline, carriage return), and is packaged as a Text object.
  • 21. Binary Input: SequenceFileInputFormat • Hadoop’s sequence file format stores sequences of binary key-value pairs. • Sequence files are well suited as a format for MapReduce data since they are splitable. • Support compression as a part of the format.
  • 23. Output Formats Class Hierarchy
  • 24. Output Types • Text Output • The default output format, TextOutputFormat, writes records as lines of text. • Binary Output • SequenceFileOutputFormat writes sequence files for its output. • Multiple Outputs • MultipleOutputs allows you to write data to files whose names are derived from the output keys and values, or in fact from an arbitrary string.