MapReduce allows distributed processing of large datasets across clusters of computers. It works by splitting the input data into independent chunks which are processed by the map function in parallel. The map function produces intermediate key-value pairs which are grouped by the reduce function to form the output data. Fault tolerance is achieved through replication of data across nodes and re-executing failed tasks. This makes MapReduce suitable for efficiently processing very large datasets in a distributed environment.
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Big data analytics with Apache Hadoop
1. BIG DATA ANALYTICS
WITH APACHE- HADOOP
“Big Data: A Revolution that Will Transform How We Live, Work, and Think”
-Viktor Mayer-Schönberger and Kenneth Cukier
2. Team Members
Abhishek Kumar : Y11UC010
Sachin Mittal : Y11UC189
Subodh Rawani : Y11UC230
Suman Saurabh : Y11UC231
3. Contents
1. What is Big Data ?
Definition
Turning Data to Value: 5v’s
2. Big Data Analytics
3. Big Data and Hadoop
History of Hadoop
About Apache Hadoop
Key Features of Hadoop
4. Hadoop and MapReduce
About MapReduce
MapReduce Architecture
MapReduce Functionality
MapReduce Examples
5. Definition
“Data is the oil of the 21st century, and analytics is the combustion engine”
-Peter Sondergaard, Senior Vice President, Gartner Research
“Big- Data are high volume, high velocity and high variety of information assets that require new
form of processing to enable enhanced decision making insight discovery & process
optimisation.”
“It is a subjective term, what involves is analysis of data from multiple sources and is joined and
aggregated in arbitrary ways enabling deeper analyses than any one system can provide”.
-Tom White in Hadoop the Definitive Guide
Big Data is fuelled by two things:
• The increasing ‘datafication’ of the world, allows to generate new data at frightening rates.
• Technological advancement to harness those large and complex data and perform analysis
using improved techniques.
6. Big data describes the exponential growth and availability of data, both structured and unstructured. This data
are from everywhere: Climate Sensors, Social Media post, Digital files, Buy/Sell transaction records, Cell phone
GPS signal and others.
7. Statistics of Data Generated
Big Data in Today’s Business and Technology
Environment
235 Terabytes of data has been collected by the
U.S. Library of Congress in April 2011. (Source)
Facebook stores, accesses, and analyzes 30+
Petabytes of user generated data. (Source)
Walmart handles more than 1 million customer
transactions every hour, which is imported into
databases estimated to contain more than 2.5
petabytes of data. (Source)
More than 5 billion people are calling, texting,
tweeting and browsing on mobile phones
worldwide. (Source)
In 2008, Google was processing 20,000 terabytes
of data (20 petabytes) a day. (Source)
The Rapid Growth of Unstructured Data
YouTube users upload 48 hours of new video
every minute of the day. (Source)
Brands and organizations on Facebook receive
34,722 Likes every minute of the day. (Source)
Twitter’s sees roughly 175 million tweets every day,
and has more than 465 million accounts. (Source)
In late 2011, IDC Digital Universe published a
report indicating that some 1.8 zettabytes of data
will be created that year. (Source)
In other words, the amount of data in the world
today is equal to:
Every person in the world having more than 215m high-
resolution MRI scans a day.
More than 200bn HD movies – which would take a person
47m years to watch.
9. Turning Big Data into Value: 5V’s
The Digital Era gives unprecedented
amounts of data in terms of Volume,
Velocity, Variety and Veracity and
properly channelling them to Value.
Value
Volume: Refers to the Terabytes, Petabytes as well
as Zettabytes of data generated every second.
Velocity: Speed at which new data is generated
every second. E.g. Google, Twitter, Facebook
Variety: Different formats data such as text, images,
video, video and so on can be stored and processed
rather than only Relational Databases.
Veracity: Trustworthiness of the data. E.g. Twitter
data with hash tags, abbreviations, typos and
colloquial speech as well as the reliability and
accuracy of content. Though not reliable can also be
processed.
Value: Having access to big data is no good unless
we can turn it into value.
11. Some Big Data Use Case By Industry
Telecommunications
Network analytics
Location-based services
Retail
Merchandise optimization
Supply-Chain Management
Banking
Fraud Detection
Trade Surveillance
Media
Click- Fraud Prevention
Social Graph Analysis
Energy
Smart Meter Analytics
Distribution load forecasting
Manufacturing
Customer Care Call Centers
Customer Relationship
Public
Threats Detection
Cyber Security
Healthcare
Clinical Trails data Analysis
Supply Chain Management
Insurance
Catastrophe Modelling
Claims Fraud
12.
13. Challenges of big data
How to store and protect Big data?
How to organize and catalog the data that you have backed up?
How to keep costs low while ensuring that all the critical data is
available you need it.
Analytical Challenges
Human Resources and Manpower
Technical Challenges
Privacy and Security
15. Why Big-Data Analytics?
• Understand existing data resource.
• Process them and uncover pattern,
correlations and other useful
information that can be used to make
better decisions.
• With big data analytics, data scientists
and others can analyse huge volumes
of data that conventional analytics and
business intelligence solutions can't
touch.
16. Traditional vs. Big Data Approaches
IT
Structures the
data to answer
that question
IT
Delivers a platform to
enable creative
discovery
Business
Explores what questions
could be asked
Business Users
Determine what
question to ask
Monthly sales reports
Profitability analysis
Customer surveys
Brand sentiment
Product strategy
Maximum asset utilization
Big Data Approach
Iterative & Exploratory Analysis
Traditional Approach
Structured & Repeatable Analysis
21. Brief history of Hadoop
Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used
text search library. Hadoop has its origins in Apache Nutch, an open source web
search engine, itself a part of the Lucene project.
Nutch was started in 2002, and a working crawler and search system quickly emerged.
However their architecture wouldn’t scale to the billions of pages on the Web. In 2003
Google published paper on Google’s Distributed Filesystem (GFS) which was being
used in production at Google. Hence in 2004 they implemented Nutch Distributed
Filesystem (NDFS) using GFS architecture that would solve their storage needs for
very large files generated as a part of the web crawl and indexing process.
In 2004, Google published the paper that introduced MapReduce to the world. NDFS
and the MapReduce implementation in Nutch were applicable beyond the realm of
search, and in February 2006 they moved out of Nutch to form an independent
subproject of Lucene called Hadoop.
22. Apache Hadoop
Framework for the distributed
processing of large data sets across
clusters of computers using simple
programming models.
Designed to scale up from a single
server to thousands of machines, with
a very high degree of fault tolerance.
Rather than relying on high-end
hardware, the resiliency of these
clusters comes from the software’s
ability to detect and handle failures at
the application layer.
23. Key Features of Hadoop
1. Flexible
2. Scalable
3. Building more efficient data
economy
4. Cost Effective
5. Fault Tolerant
24. 1) Flexible
1. Hadoop is schema-less, and can absorb any type of data,
structured or not, from any number of sources.
2. Data from multiple sources can be joined and aggregated in arbitrary
ways enabling deeper analyses than any one system can provide.
3. We can develop Map- Reduce programs on Linux, Windows, OS-X in
any language like Python, R, C++, Perl, Ruby, etc.
25. 2) Scalable
Scalability is one of the primary forces driving popularity and adoption
of the Apache Hadoop project. A typical use case for Hadoop is an
emerging Web site starting to run a five-node. New nodes can be
added as needed, and added without needing to change data formats,
how data is loaded, how jobs are written, or the applications on top.
1. Yahoo reportedly ran numerous clusters having 4000+ nodes with
four 1 TB drives per node, 15 PB of total storage capacity.
2. Facebook’s 2000-node warehouse cluster is provisioned for 21 PB of
total storage capacity. Extrapolating the announced growth rate, its
namespace should have close to 200 million objects by now.
3. eBay runs a 700-node cluster. Each node has 24 TB of local disk
storage, 72 GB of RAM, and a 12-core CPU. Total cluster size is 16
PB. It is configured to run 26,000 MapReduce tasks simultaneously.
26. 3) Building more efficient data economy
Data is the new currency of the modern world. Businesses that
successfully maximize its value will have a decisive impact on their own
value and on their customers success.
Apache Hadoop allows businesses to create highly scalable and cost-
efficient data stores. It offers data value at unprecedented scale.
27. 4) Cost Effective
Hadoop brings massively parallel computing to commodity servers. The
result is a sizeable decrease in the cost per terabyte of storage, which
in turn makes it affordable to model all your data.
It's a cost-effective alternative to a conventional extract, transform, and
load (ETL) process that extracts data from different systems, converts it
into a structure suitable for analysis and reporting, and loads into
database.
28. 5) Fault tolerant
When you lose a node, the system redirects work to another location of
the data and continues processing without missing a fright beat.
When any node becomes non-functional, then the node present nearby
ie. Supernode which is near completion or has already completed its
task reassigns itself to the task of that faulty node, The description of
which is present in the shared memory. Therefore a faulty node does
not have to wait for the Master node to notice about its non-
functionality and hence reduce execution time in case any of the node
gets faulty.
30. HDFS Architecture
HDFS is a filesystem designed for storing
very large files with streaming data access
patterns, running on clusters of commodity
hardware. HDFS clusters consist of a
NameNode that manages the file system
metadata and DataNodes that store the
actual data.
Uses:
• Storage of large imported files from
applications outside of the Hadoop
ecosystem.
• Staging of imported files to be
processed by Hadoop applications.
31. Hive connects the gap between SQL based RDBMS and NoSQL based
Hadoop. Datasets from HDFS and HBase can be mapped onto Hive from
which queries can be written in an SQL like language called HiveQL.
Though Hive may not be the perfect panacea for complex operations, it
reduces the difficulty of having to write MapReduce jobs if a
programmer knows SQL..
•Hbase:
• Hive:
Inspired by Google’s BigTable, HBase is a NoSQL distributed column-
oriented database that runs on top of HDFS on which random read/write
can be performed. HBase enables you to store and retrieve random data
in near real-time. It can also be combined with MapReduce to ease bulk
operations such as indexing or analysis.
•Pig: Apache Pig uses the data flow language Pig Latin. Pig supports relational
operations such as join, group and aggregate and it can be scaled across
multiple servers simultaneously. Time intensive ETL operations, analytics
on sample data, running complex tasks that collates multiple data
sources are some of the use cases that can be handled using Pig.
32. Flume is a distributed system that aggregates streaming data from
different sources and adds them to a centralized datastore for Hadoop
cluster such as HDFS. Flume facilitates data aggregation which involves
importing and processing data for computation into HDFS or storage in
databases.
• Sqoop:
•Flume:
Sqoop is the latest Hadoop framework to get enlisted in Bossie award for
open source big data tools. Sqoop enables two-way import/export of
bulk data between HDFS/Hive/HBase and relational or structured
databases. Unlike Flume, Sqoop helps in data transfer of structured
datasets.
• Mahout: Mahout is a suite of scalable machine learning libraries implemented on
top of MapReduce. Commercial use cases of machine learning include
predictive analysis via collaborative filtering, clustering and classification.
Product/service recommendations, investigative data mining, statistical
analysis are some of its generic use cases.
34. MapReduce
MapReduce is a programming paradigm for easily writing applications which process
vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters
(thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.
The framework is divided into two parts:
Map, allows to parcels out work to different nodes in the distributed cluster.
Reduce, collates the work and resolves the results into a single value.
MapReduce framework consists of a single master JobTracker and one
slave TaskTracker per cluster-node. Master is responsible for scheduling the jobs'
component tasks on the slaves, monitoring them and re-executing the failed tasks.
Although the Hadoop framework is implemented in Java, MapReduce applications
can be written in Python, Ruby, R, C++. Eg. Hadoop Streaming, Hadoop Pipes.
36. Map Reduce core functionality (I)
•
Data flow beyond the two key pieces (map and reduce):
• Input reader – divides input into appropriate size splits which get
assigned to a Map function.
• Map function – maps file data to smaller, intermediate <key, value>
pairs.
• Compare function – input for Reduce is pulled from the Map
intermediate output and sorted according to the compare function.
• Reduce function – takes intermediate values and reduces to a
smaller solution handed back to the framework.
• Output writer – writes file output
37. How MapReduce Works
User to do list:
Indicate
• input/output files
• M: number of map tasks
• R: number of reduce tasks
• W: number of
machines
Write map and reduce
functions
Submit the job
Input files are split into M pieces
on distributed file system
• Typically ~ 64 MB blocks
Intermediate files created from
map tasks are written to local disk
A sorted and shuffled output is sent
to reduce framework (combiner is
also used in most of the cases).
Output files are written to
distributed file system.
39. MAP Reduce Examples
1. WordCount ( Reads the text file and counts how often words occur ).
2. TopN ( To find top-n used words of a text file ).
40. 1. WordCount
Reads text files and counts how often each word occur.
The input and the output are text files,
Need three classes:
• WordCount.java: Driver class with main function
• WordMapper.java: Mapper class with map method
• SumReducer.java: Reducer class with reduce method
41.
42. WordCount Example (Contd.)
WordMapper.java
Mapper class with map function
For the given sample input
assuming two map nodes
The sample input is distributed to the maps
the first map emits:
<Hello, 1> <World, 1> <Bye, 1> <World, 1>
The second map emits:
<Hello, 1> <Hadoop, 1> <Goodbye, 1> <Hadoop, 1>
43. WordCount Example (Contd.)
SumReducer.java
Reducer class with reduce function
For the input from two Mappers
the reduce method just sums up the values,
which are the occurrence counts for each key
Thus the output of the job is:
<Bye, 1> <Goodbye, 1> <Hadoop, 2> <Hello, 2> <World, 2>
45. public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.println("usage: [input] [output]");
System.exit(-1);
}
Job job = Job.getInstance(new Configuration());
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setJarByClass(WordCount.class);
job.submit();
}
}
Check Input and Output files
WordCount (Driver)
46. Set output (key, value) types
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.println("usage: [input] [output]");
System.exit(-1);
}
Job job = Job.getInstance(new Configuration());
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setJarByClass(WordCount.class);
job.submit();
}
}
WordCount (Driver)
47. Set Mapper/Reducer classes
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.println("usage: [input] [output]");
System.exit(-1);
}
Job job = Job.getInstance(new Configuration());
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setJarByClass(WordCount.class);
job.submit();
}
}
WordCount (Driver)
48. Set Input/Output format classes
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.println("usage: [input] [output]");
System.exit(-1);
}
Job job = Job.getInstance(new Configuration());
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setJarByClass(WordCount.class);
job.submit();
}
}
WordCount (Driver)
49. Set Input/Output paths
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.println("usage: [input] [output]");
System.exit(-1);
}
Job job = Job.getInstance(new Configuration());
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setJarByClass(WordCount.class);
job.submit();
}
}
WordCount (Driver)
50. Set Driver class
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.println("usage: [input] [output]");
System.exit(-1);
}
Job job = Job.getInstance(new Configuration());
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setJarByClass(WordCount.class);
job.submit();
}
}
WordCount (Driver)
51. Submit the job to the master node
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.println("usage: [input] [output]");
System.exit(-1);
}
Job job = Job.getInstance(new Configuration());
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setJarByClass(WordCount.class);
job.submit();
}
}
WordCount (Driver)
52. WordMapper (Mapper class)
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordMapper extends Mapper<Object, Text, Text, IntWritable> {
private Text word = new Text();
private final static IntWritable one = new IntWritable(1);
@Override
public void map(Object key, Text value, Context contex) throws IOException, InterruptedException {
// Break line into words for processing
StringTokenizer wordList = new StringTokenizer(value.toString());
while (wordList.hasMoreTokens()) {
word.set(wordList.nextToken());
contex.write(word, one);
}
}
}
53. public class WordMapper extends Mapper<Object, Text, Text, IntWritable> {
private Text word = new Text();
private final static IntWritable one = new IntWritable(1);
@Override
public void map(Object key, Text value,
Context contex) throws IOException, InterruptedException {
// Break line into words for processing
StringTokenizer wordList = new StringTokenizer(value.toString());
while (wordList.hasMoreTokens()) {
word.set(wordList.nextToken());
contex.write(word, one);
}
}
}
Extends mapper class with input/
output keys and values
WordMapper (Mapper class)
54. public class WordMapper extends Mapper<Object, Text, Text, IntWritable> {
private Text word = new Text();
private final static IntWritable one = new IntWritable(1);
@Override
public void map(Object key, Text value,
Context contex) throws IOException, InterruptedException {
// Break line into words for processing
StringTokenizer wordList = new StringTokenizer(value.toString());
while (wordList.hasMoreTokens()) {
word.set(wordList.nextToken());
contex.write(word, one);
}
}
}
Output (key, value) typesWordMapper (Mapper class)
55. public class WordMapper extends Mapper<Object, Text, Text, IntWritable> {
private Text word = new Text();
private final static IntWritable one = new IntWritable(1);
@Override
public void map(Object key, Text value,
Context contex) throws IOException, InterruptedException {
// Break line into words for processing
StringTokenizer wordList = new StringTokenizer(value.toString());
while (wordList.hasMoreTokens()) {
word.set(wordList.nextToken());
contex.write(word, one);
}
}
}
Input (key, value) types
Output as Context type
WordMapper (Mapper class)
56. Read words from each line
of the input file
public class WordMapper extends Mapper<Object, Text, Text, IntWritable> {
private Text word = new Text();
private final static IntWritable one = new IntWritable(1);
@Override
public void map(Object key, Text value,
Context contex) throws IOException, InterruptedException {
// Break line into words for processing
StringTokenizer wordList = new StringTokenizer(value.toString());
while (wordList.hasMoreTokens()) {
word.set(wordList.nextToken());
contex.write(word, one);
}
}
}
WordMapper (Mapper class)
57. Count each word
public class WordMapper extends Mapper<Object, Text, Text, IntWritable> {
private Text word = new Text();
private final static IntWritable one = new IntWritable(1);
@Override
public void map(Object key, Text value,
Context contex) throws IOException, InterruptedException {
// Break line into words for processing
StringTokenizer wordList = new StringTokenizer(value.toString());
while (wordList.hasMoreTokens()) {
word.set(wordList.nextToken());
contex.write(word, one);
}
}
}
WordMapper (Mapper class)
58. Shuffler/Sorter
Maps emit (key, value) pairs
Shuffler/Sorter of Hadoop framework
Sort (key, value) pairs by key
Then, append the value to make (key, list of values) pair
For example,
The first, second maps emit:
<Hello, 1> <World, 1> <Bye, 1> <World, 1>
<Hello, 1> <Hadoop, 1> <Goodbye, 1> <Hadoop, 1>
Shuffler produces and it becomes the input of the reducer
<Bye, 1>, <Goodbye, 1>, <Hadoop, <1,1>>, <Hello, <1,
1>>, <World, <1,1>>
59. SumReducer (Reducer class)
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable totalWordCount = new IntWritable();
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int wordCount = 0;
Iterator<IntWritable> it=values.iterator();
while (it.hasNext()) {
wordCount += it.next().get();
}
totalWordCount.set(wordCount);
context.write(key, totalWordCount);
}
}
60. public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable totalWordCount = new IntWritable();
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int wordCount = 0;
Iterator<IntWritable> it=values.iterator();
while (it.hasNext()) {
wordCount += it.next().get();
}
totalWordCount.set(wordCount);
context.write(key, totalWordCount);
}
}
Extends Reducer class with input/
output keys and valuesSumReducer (Reducer class)
61. Set output value type
SumReducer (Reducer class)
public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable totalWordCount = new IntWritable();
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int wordCount = 0;
Iterator<IntWritable> it=values.iterator();
while (it.hasNext()) {
wordCount += it.next().get();
}
totalWordCount.set(wordCount);
context.write(key, totalWordCount);
}
}
62. Set input (key, list of values) type
and output as Context class
SumReducer (Reducer class)
public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable totalWordCount = new IntWritable();
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int wordCount = 0;
Iterator<IntWritable> it=values.iterator();
while (it.hasNext()) {
wordCount += it.next().get();
}
totalWordCount.set(wordCount);
context.write(key, totalWordCount);
}
}
63. For each word,
Count/sum the number of values
SumReducer (Reducer class)
public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable totalWordCount = new IntWritable();
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int wordCount = 0;
Iterator<IntWritable> it=values.iterator();
while (it.hasNext()) {
wordCount += it.next().get();
}
totalWordCount.set(wordCount);
context.write(key, totalWordCount);
}
}
64. For each word,
Total count becomes the value
SumReducer (Reducer class)
public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable totalWordCount = new IntWritable();
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int wordCount = 0;
Iterator<IntWritable> it=values.iterator();
while (it.hasNext()) {
wordCount += it.next().get();
}
totalWordCount.set(wordCount);
context.write(key, totalWordCount);
}
}
65. Reducer
Input: Shuffler produces and it becomes the input of the
reducer
<Bye, 1>, <Goodbye, 1>, <Hadoop, <1,1>>, <Hello, <1, 1>>, <World,
<1,1>>
Output
<Bye, 1>, <Goodbye, 1>, <Hadoop, 2>, <Hello, 2>, <World, 2>
SumReducer
66. Map()
The Mapper implementation, via the map method, processes one line at a
time, as provided by the specified TextInputFormat. It then splits the line
into tokens separated by whitespaces, via the StringTokenizer, and emits a
key-value pair of < <word>, 1>.
For asample input the first map emits:
< Deer, 1>
< Beer, 1>
< River, 1>
The second map emits:
< Car, 1>
< River, 1>
< Car, 1>
Map() and Reduce()
The output of the first map:
< Deer, 1>
< Beer, 1>
< River, 1>
The output of the second map:
< Car, 2>
< River, 1>
67. Map() and Reduce() (Continued)
Reducer()
The Reducer implementation, via the reduce method just sums up the
values, which are the occurence counts for each key (i.e. words in this
example).
68. 2. TopN
We want to find top-n used words of a text file: “Flatland” by E. Abbot.
The input and the output are text files,
Need three classes
TopN.java
Driver class with main function
TopNMapper.java
Mapper class with map method
TopNReducer.java
Reducer class with reduce method
71. TopNMapper
/**
* The mapper reads one line at the time, splits it into an array of single words and emits every
* word to the reducers with the value of 1.
*/
public static class TopNMapper extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
private String tokens = "[_|$#<>^=[]*/,;,.-:()?!"']";
@Override
public void map(Object key, Text value, Context context) throws IOException, InterruptedException
{
String cleanLine = value.toString().toLowerCase().replaceAll(tokens, " ");
StringTokenizer itr = new StringTokenizer(cleanLine);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken().trim());
context.write(word, one);
}
}
}
72. TopNReducer
/**
* The reducer retrieves every word and puts it into a Map: if the word already exists in the
* map, increments its value, otherwise sets it to 1.
*/
public static class TopNReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private Map<Text, IntWritable> countMap = new HashMap<>();
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException,
InterruptedException {
//computes the number of occurrences of a single word
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
// puts the number of occurrences of this word into the map.
// We need to create another Text object because the Text instance
// we receive is the same for all the words
countMap.put(new Text(key), new IntWritable(sum));
}
74. TopN- Results
The 2286
Of 1634
And 1098
That 499
You 429
Not 317
But 279
For 267
By 317
In shuffle and sort phase, the partioner will send
every single word (the key) with the value “1” to
the reducers.
All these network transmissions can be
minimized if we reduce the data locally the data
that the mapper will emit.
This is obtained by Combiner.
75. TopNCombiner
/**
* The combiner retrieves every word and puts it into a Map: if the word already exists in the
* map, increments its value, otherwise sets it to 1.
*/
public static class TopNCombiner extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException,
InterruptedException {
// computes the number of occurrences of a single word
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
76. Hadoop Output: With and Without Combiner
Without Combiner ->
Map input records = 4239
Map output records = 37817
Map output bytes = 359621
Input split bytes = 118
Combine input records = 0
Combine output records = 0
Reduce input groups = 4987
Reduce shuffle bytes = 435261
Reduce input records = 37817
Reduce output records = 20
With Combiner ->
Map input records = 4239
Map output records = 37817
Map output bytes = 359621
Input split bytes = 116
Combine input records = 37817
Combine output records = 20
Reduce input groups = 20
Reduce shuffle bytes = 194
Reduce input records = 20
Reduce output records = 20
77. Advantages and Disadvantages of using Combiner
Advantages ->
Network transmission are minimized.
Disadvantages ->
Hadoop doesn’t guarantee the execution of a combiner: it can be
executed 0,1 or multiple times on the same input.
Key-value pairs emitted from mapper are stored in local file
system, and execution of combiner can cause extensive IO
operations.