SlideShare ist ein Scribd-Unternehmen logo
1 von 47
Expected … what to be said!
● History.
● What is Hadoop.
● Hadoop vs SQl.
● MapReduce.
● Hadoop Building Blocks.
● Installing, Configuring and Running Hadoop.
● Anatomy of MapReduce program.
Hadoop Series Resources
How hadoop was born?
Doug Cutting
Challenges of Distributed Processing of
Large Data
● How to distribute the work?
● How to store and distribute the data itself?
● How to overcome failures?
● How to balance the load?
● How to deal with unstructured data?
● ...
Hadoop tackles these
challenges!
So, what’s Hadoop?
What is Hadoop?
Hadoop is an open source framework for writing and
running distributed applications that process large
amounts of data.
Key distinctions of Hadoop:
● Accessible
● Robust
● Scalable
● Simple
Hadoop vs SQL
● Structured and Unstructured data.
● Datastore and Data Analysis.
● Scale-out and Scale-up.
● Offline batch processing and Online
transactions.
Hadoop Uses
MapReduce
What is MapReduce?...
● Parallel programming model for clusters of
commodity machines.
● MapReduce provides:
o Automatic parallelization & distribution.
o Fault tolerance.
o Locality of data.
What is MapReduce?
MapReduce … Map then Reduce
Keys and Values
● Key/Value pairs.
● Keys divide Reduce Space.
Input Output
Map <k1, v1> list(<k2, v2>)
Reduce <k2, list(v2)> list(<k3, v3>)
WordCount in Action
Input:
foo.txt:
“This is the foo file”
bar.txt:
“And this is the bar one”
1
is
1
the
1
foo
1
file
1
and
1
this
1
is
1
the
1
Reduce#2:
Input:
Output:
is, [1, 1] is,
2
Reduce#1:
Input:
Output:
this, [1, 1]
this, 2
Reduce#3:
Input:
Output:
foo, [1]
foo, 1.
.
Final output:
this 2
is 2
the 2
foo 1
file 1
and 1
bar 1
one 1
WordCount with MapReduce
map(String filename, String document) {
List<String> T = tokenize(document);
for each token in T {
emit ((String)token,
(Integer) 1);
}
}
reduce(String token, List<Integer> values) {
Integer sum = 0;
for each value in values {
sum = sum + value;
}
emit ((String)token, (Integer) sum);
}
Hadoop Building Blocks
How does hadoop work?...
Hadoop Building Blocks
1. NameNode
2. DataNode
3. Secondary NameNode
4. JobTracker
5. TaskTracker
HDFS: NameNode and DataNodes
JobTracker and TaskTracker
Typical Hadoop Cluster
Running Hadoop
Three modes to run Hadoop:
1. Local (standalone) mode.
2. Pseudo-distributed mode “cluster of one” .
3. Fully distributed mode.
An Action
Running Hadoop on Local Machine
Actions ...
1. Installing Hadoop.
2. Configuring Hadoop (Pseudo-distributed mode).
3. Running WordCount example.
4. Web-based cluster UI.
HDFS
1. HDFS is a filesystem designed for large-scale
distributed data processing.
2. HDFS isn’t a native Unix filesystem.
Basic File Commands:
$ hadoop fs -cmd <args>
$ hadoop fs –ls
$ hadoop fs –mkdir /user/chuck
$ hadoop fs -copyFromLocal
Anatomy of a MapReduce program
MapReduce and beyond
Hadoop
1. Data Types
2. Mapper
3. Reducer
4. Partitioner
5. Combiner
6. Reading and Writing
a. InputFormat
b. OutputFormat
Anatomy of a MapReduce program
Hadoop Data Types
● Certain defined way of serializing key/value pairs.
● Values should implement Writable Interface.
● Keys should implement WritableComparable interface.
● Some predefined classes:
o BooleanWritable.
o ByteWritable.
o IntWritable
o ...
Mapper
Mapper
1. Mapper<K1,V1,K2,V2>
2. Override method:
void map(K1 key, V1 value, Context context)
3. Use context.write(K2, V2) to emit key/value pairs.
WordCount Mapperpublic static class Map extends Mapper<LongWritable, Text,
Text, IntWritable> {
private final static IntWritable one = new
IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context
context){
String line = value.toString();
StringTokenizer tokenizer = new
StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
Predefined Mappers
Reducer
Reducer
1. Extends Reducer<K1,V1,K2,V2>
2. Overrides method:
void reduce(K2, Iterable<V2>, Context context)
3. Use context.write(K2, V2) to emit key/value pairs.
WordCount Reducer
public static class Reduce
extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context){
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
Predefined Reducers
Partitioner
Partitioner
The partitioner decides
which key goes where
class WordSizePartitioner extends
Partitioner<Text, IntWritable> {
@Override
public int getPartition(Text
word, IntWritable count, int
numOfPartions) {
return 0;
}
}
Combiner
Combiner
It’s a local Reduce Task at
Mapper.
WordCout Mapper Output:
1. Without Combiner:<the, 1>, <file,
1>, <the, 1>, …
2. With Combiner:<the, 2>, <file, 2>,
...
Reading and Writing
Reading and Writing
1. Input data usually resides in large files.
2. MapReduce’s processing power is the splitting of the
input data into chunks(InputSplit).
3. Hadoop’s FileSystem provides the class
FSDataInputStream for file reading. It extends
DataInputStream with random read access.
InputFormat Classes
● TextInputFormat
o <offset, line>
● KeyValueTextInputFormat
o keytvaue => <key, value>
● NLineInputFormat
o <offset, nLines>
You can define your own InputFormat class ...
1. The output has no splits.
2. Each reducer generates output file named
part-nnnnn, where nnnnn is the partition ID
of the reducer.
Predefined OutputFormat classes:
> TextOutputFormat <k, v> => ktv
OutputFormat
Recap
END OF SESSION #1
Q

Weitere ähnliche Inhalte

Was ist angesagt?

Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
Marin Dimitrov
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
Takahiro Inoue
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2
Tianwei Liu
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
Pallav Jha
 

Was ist angesagt? (20)

Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
 
MapReduce
MapReduceMapReduce
MapReduce
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
 
Hadoop fault-tolerance
Hadoop fault-toleranceHadoop fault-tolerance
Hadoop fault-tolerance
 
The Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXThe Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphX
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Map Reduce basics
Map Reduce basicsMap Reduce basics
Map Reduce basics
 
Unit 1
Unit 1Unit 1
Unit 1
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Spark and shark
Spark and sharkSpark and shark
Spark and shark
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 
BDAS Shark study report 03 v1.1
BDAS Shark study report  03 v1.1BDAS Shark study report  03 v1.1
BDAS Shark study report 03 v1.1
 

Andere mochten auch

An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
Frane Bandov
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Skillspeed
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
rantav
 
Assign, commit, and review - A developer’s guide to OpenStack contribution-20...
Assign, commit, and review - A developer’s guide to OpenStack contribution-20...Assign, commit, and review - A developer’s guide to OpenStack contribution-20...
Assign, commit, and review - A developer’s guide to OpenStack contribution-20...
OpenCity Community
 

Andere mochten auch (20)

Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
Map Reduce: An Example (James Grant at Big Data Brighton)
Map Reduce: An Example (James Grant at Big Data Brighton)Map Reduce: An Example (James Grant at Big Data Brighton)
Map Reduce: An Example (James Grant at Big Data Brighton)
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to MapReduce using Disco
Introduction to MapReduce using DiscoIntroduction to MapReduce using Disco
Introduction to MapReduce using Disco
 
Hadoop ppt2
Hadoop ppt2Hadoop ppt2
Hadoop ppt2
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce Details
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2
 
Presentation1
Presentation1Presentation1
Presentation1
 
Assign, commit, and review - A developer’s guide to OpenStack contribution-20...
Assign, commit, and review - A developer’s guide to OpenStack contribution-20...Assign, commit, and review - A developer’s guide to OpenStack contribution-20...
Assign, commit, and review - A developer’s guide to OpenStack contribution-20...
 
Module 3
Module 3Module 3
Module 3
 
Microsoft Excel and Financial Modeling - Global Survey Results (July 2011)
Microsoft Excel and Financial Modeling - Global Survey Results (July 2011)Microsoft Excel and Financial Modeling - Global Survey Results (July 2011)
Microsoft Excel and Financial Modeling - Global Survey Results (July 2011)
 
Маркетинговая программа "Быстрого роста 3+3"
Маркетинговая программа "Быстрого роста 3+3"Маркетинговая программа "Быстрого роста 3+3"
Маркетинговая программа "Быстрого роста 3+3"
 
Indicaciones de un helipuerto
Indicaciones de un helipuertoIndicaciones de un helipuerto
Indicaciones de un helipuerto
 
Egoera: La Economía de Bizkaia - Junio 2016 - nº23
Egoera: La Economía de Bizkaia - Junio 2016 - nº23Egoera: La Economía de Bizkaia - Junio 2016 - nº23
Egoera: La Economía de Bizkaia - Junio 2016 - nº23
 

Ähnlich wie Introduction to MapReduce and Hadoop

AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Andrey Vykhodtsev
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
gothicane
 
JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop Papyrus
Koichi Fujikawa
 

Ähnlich wie Introduction to MapReduce and Hadoop (20)

Introducción a hadoop
Introducción a hadoopIntroducción a hadoop
Introducción a hadoop
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Map-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopMap-Reduce and Apache Hadoop
Map-Reduce and Apache Hadoop
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Spark overview
Spark overviewSpark overview
Spark overview
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Cloud jpl
Cloud jplCloud jpl
Cloud jpl
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with Hadoop
 
JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop Papyrus
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the Rescue
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop ClustersA performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 

Kürzlich hochgeladen

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 

Kürzlich hochgeladen (20)

The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
 
Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodology
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxBUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
ManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide Deck
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 

Introduction to MapReduce and Hadoop

  • 1.
  • 2. Expected … what to be said! ● History. ● What is Hadoop. ● Hadoop vs SQl. ● MapReduce. ● Hadoop Building Blocks. ● Installing, Configuring and Running Hadoop. ● Anatomy of MapReduce program.
  • 4.
  • 5. How hadoop was born? Doug Cutting
  • 6. Challenges of Distributed Processing of Large Data ● How to distribute the work? ● How to store and distribute the data itself? ● How to overcome failures? ● How to balance the load? ● How to deal with unstructured data? ● ...
  • 8. What is Hadoop? Hadoop is an open source framework for writing and running distributed applications that process large amounts of data. Key distinctions of Hadoop: ● Accessible ● Robust ● Scalable ● Simple
  • 9. Hadoop vs SQL ● Structured and Unstructured data. ● Datastore and Data Analysis. ● Scale-out and Scale-up. ● Offline batch processing and Online transactions.
  • 11. ● Parallel programming model for clusters of commodity machines. ● MapReduce provides: o Automatic parallelization & distribution. o Fault tolerance. o Locality of data. What is MapReduce?
  • 12. MapReduce … Map then Reduce
  • 13. Keys and Values ● Key/Value pairs. ● Keys divide Reduce Space. Input Output Map <k1, v1> list(<k2, v2>) Reduce <k2, list(v2)> list(<k3, v3>)
  • 14. WordCount in Action Input: foo.txt: “This is the foo file” bar.txt: “And this is the bar one” 1 is 1 the 1 foo 1 file 1 and 1 this 1 is 1 the 1 Reduce#2: Input: Output: is, [1, 1] is, 2 Reduce#1: Input: Output: this, [1, 1] this, 2 Reduce#3: Input: Output: foo, [1] foo, 1. . Final output: this 2 is 2 the 2 foo 1 file 1 and 1 bar 1 one 1
  • 15. WordCount with MapReduce map(String filename, String document) { List<String> T = tokenize(document); for each token in T { emit ((String)token, (Integer) 1); } } reduce(String token, List<Integer> values) { Integer sum = 0; for each value in values { sum = sum + value; } emit ((String)token, (Integer) sum); }
  • 16. Hadoop Building Blocks How does hadoop work?...
  • 17. Hadoop Building Blocks 1. NameNode 2. DataNode 3. Secondary NameNode 4. JobTracker 5. TaskTracker
  • 18. HDFS: NameNode and DataNodes
  • 21. Running Hadoop Three modes to run Hadoop: 1. Local (standalone) mode. 2. Pseudo-distributed mode “cluster of one” . 3. Fully distributed mode.
  • 22. An Action Running Hadoop on Local Machine
  • 23. Actions ... 1. Installing Hadoop. 2. Configuring Hadoop (Pseudo-distributed mode). 3. Running WordCount example. 4. Web-based cluster UI.
  • 24. HDFS 1. HDFS is a filesystem designed for large-scale distributed data processing. 2. HDFS isn’t a native Unix filesystem. Basic File Commands: $ hadoop fs -cmd <args> $ hadoop fs –ls $ hadoop fs –mkdir /user/chuck $ hadoop fs -copyFromLocal
  • 25. Anatomy of a MapReduce program MapReduce and beyond
  • 26. Hadoop 1. Data Types 2. Mapper 3. Reducer 4. Partitioner 5. Combiner 6. Reading and Writing a. InputFormat b. OutputFormat
  • 27. Anatomy of a MapReduce program
  • 28. Hadoop Data Types ● Certain defined way of serializing key/value pairs. ● Values should implement Writable Interface. ● Keys should implement WritableComparable interface. ● Some predefined classes: o BooleanWritable. o ByteWritable. o IntWritable o ...
  • 30. Mapper 1. Mapper<K1,V1,K2,V2> 2. Override method: void map(K1 key, V1 value, Context context) 3. Use context.write(K2, V2) to emit key/value pairs.
  • 31. WordCount Mapperpublic static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context){ String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); }
  • 34. Reducer 1. Extends Reducer<K1,V1,K2,V2> 2. Overrides method: void reduce(K2, Iterable<V2>, Context context) 3. Use context.write(K2, V2) to emit key/value pairs.
  • 35. WordCount Reducer public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context){ int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } }
  • 38. Partitioner The partitioner decides which key goes where class WordSizePartitioner extends Partitioner<Text, IntWritable> { @Override public int getPartition(Text word, IntWritable count, int numOfPartions) { return 0; } }
  • 40. Combiner It’s a local Reduce Task at Mapper. WordCout Mapper Output: 1. Without Combiner:<the, 1>, <file, 1>, <the, 1>, … 2. With Combiner:<the, 2>, <file, 2>, ...
  • 42. Reading and Writing 1. Input data usually resides in large files. 2. MapReduce’s processing power is the splitting of the input data into chunks(InputSplit). 3. Hadoop’s FileSystem provides the class FSDataInputStream for file reading. It extends DataInputStream with random read access.
  • 43. InputFormat Classes ● TextInputFormat o <offset, line> ● KeyValueTextInputFormat o keytvaue => <key, value> ● NLineInputFormat o <offset, nLines> You can define your own InputFormat class ...
  • 44. 1. The output has no splits. 2. Each reducer generates output file named part-nnnnn, where nnnnn is the partition ID of the reducer. Predefined OutputFormat classes: > TextOutputFormat <k, v> => ktv OutputFormat
  • 45. Recap
  • 47. Q

Hinweis der Redaktion

  1. https://sites.google.com/site/hadoopintroduction/home/what-is-hadoop
  2. Lucene is a full featured text indexer and searching library. Nutch was trying to build a complete web search engine with Lucene, it has web crawler and HTML parser and so on.. Problem: There are billions of web pages there!! What can the poor Nutch do? > Google announced GFS and MapReduce 2004, they said that they are using these techniques in their search engine … realy? :/ < Doug and his team used these techniques for nutch and then Hadoop was born. Doug Cutting
  3. Challenges in processing Large Data in a distributed way.
  4. Accessible—Hadoop runs on large clusters of commodity machines or on cloud computing services such as Amazon’s Elastic Compute Cloud (EC2). Robust—Because it is intended to run on commodity hardware, Hadoop is archi­tected with the assumption of frequent hardware malfunctions. It can gracefully handle most such failures. Scalable—Hadoop scales linearly to handle larger data by adding more nodes to the cluster. Simple—Hadoop allows users to quickly write efficient parallel code. Hadoop in Action section 1.2
  5. REF: https://sites.google.com/site/hadoopintroduction/home/comparing-sql-databases-and-hadoop
  6. REF: https://developer.yahoo.com/hadoop/tutorial/module4.html
  7. Table from “Hadoop In Action” Images source: https://developer.yahoo.com/hadoop/tutorial/module4.html
  8. Pseudo-code for map and reduce functions for word counting Source: Hadoop In Action
  9. We now know a general overview about mapreduce, let’s see how hadoop works
  10. Hadoop In Action Figure 2.1
  11. Local (standalone) mode. No HDFS. No Hadoop Daemons. Debugging and testing the logic of MapReduce program. Pseudo-distributed mode. All daemons running on a single machine. Debugging your code, allowing you to examine memory usage, HDFS input/out­put issues, and other daemon interactions. Fully distributed mode.
  12. http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
  13. This slide is initially left blank.
  14. https://developer.yahoo.com/hadoop/tutorial/module4.html
  15. This slide is initially left blank.
  16. When the reducer task receives the output from the various mappers, it sorts the incoming data on the key of the (key/value) pair and groups together all values of the same key.
  17. When the reducer task receives the output from the various mappers, it sorts the incoming data on the key of the (key/value) pair and groups together all values of the same key.