SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Downloaden Sie, um offline zu lesen
MapReduce and the New
Software Stack
Maruf Aytekin
PhD Student
BAU Computer Engineering Department
Besiktas/Istanbul
January 5, 2015
Outline
• Introduction
• DFS
• MapReduce
• Examples
• Matrix Calculation on Hadoop
Introduction
Modern data-mining or ML applications,
called «big-data analysis» requires us to
manage massive amounts of data quickly.
Important Examples
• The ranking of Web pages by importance,
which involves an iterated matrix-vector
multiplication where the dimension is
many billions.
• Searches in social-networking sites, which
involve graphs with hundreds of millions
of nodes and many billions of edges.
• Processing large amount of text or
streams such as news recommendation.
New software stack
• Not a “supercomputer” (Beowulf etc.)
• “computing clusters” – large collections of
commodity hardware, including conventional
processors (“compute nodes”) connected by
Ethernet cables or inexpensive switches.
Distributed File System
• The new form of file system which features
much larger units than the disk blocks in a
conventional operating system.
• Files can be enormous, possibly a terabytes
in size.
• Files are rarely updated.
Physical Organization
• Files are divided into chunks
• Chunks are replicated
DFS Implementations
• The Google File System (GFS)
• Hadoop Distributed File System (HDFS)
• CloudStore, by Kosmix
HDFS Architecture
Block Replication
MapReduce
Style of computing/framework/pattern.
Implementations:
• MapReduce by Google (internal)
• Hadoop by the Apache Foundation.
MapReduce
Operates exclusively on <key, value> pairs.
(input) <k1, v1>
-> map -> <k2, v2>
-> combine -> <k2, v2>
-> reduce -> <k3, v3> (output)
MapReduce Computation
MapReduce
In brief, a MapReduce computation executes as follows:
• Chunks from a DFS are given to Map tasks.
• These Map tasks turn the chunks into a sequence of
<key, value> pairs.
• The <key,value> pairs from each Map task are
collected by a master controller and sorted by key.
(Combine)
• The keys are divided among all the Reduce tasks, so
all <key,value> pairs with the same key wind up at
the same Reduce task.
• The Reduce tasks work on one key at a time and
processes values for that key then outputs the results
as <key,value> pairs.
Execution of MapReduce
Hello World
Word Count
• file01:
Hello World Bye World
• file02:
Hello Hadoop Goodbye Hadoop
Word Count
For the given sample input
the first map emits:
< Hello, 1 >
< World, 1 >
< Bye, 1 >
< World, 1 >
The second map emits:
< Hello, 1 >
< Hadoop, 1 >
< Goodbye, 1 >
< Hadoop, 1 >
Combiner:
After being sorted on the keys:
The output of the first map:
< Bye, 1 >
< Hello, 1 >
< World, 2 >
The output of the second map:
< Goodbye, 1 >
< Hadoop, 2 >
< Hello, 1 >
Word Count
Thus the output of the job is:
< Bye, 1 >
< Goodbye, 1 >
< Hadoop, 2 >
< Hello, 2 >
< World, 2 >
The Reducer implementation, via the reduce method just sums up
the values, which are the occurrence counts for each key.
M
j
İ
N
k
j
Matrix Calculation
P = M N
k
i
Matrix Data Model for MapReduce:
M (i, j,mij )
N (j, k, njk)
P(1,1) P(1,2)
Matrix Data Files 

for MapReduce
M,0,0,10.0
M,0,2,9.0
M,0,3,9.0
M,1,0,1.0
M,1,1,3.0
M,1,2,18.0
M,1,3,25.2
.
.
.
M, i, j, mij
N,0,0,1.0
N,0,2,3.0
N,0,4,2.0
N,1,0,2.0
N,3,2,-1.0
N,3,6,4.0
N,4,6,5.0
.
.
.
N (j, k, njk)
Map
Reduce
Example
Map Task
Matrix M
key, value pairs produced as
follows:
Matrix N
key, value pairs produced as
follows:
Map Task Output
Reduce Task
P =
Application
• Run the application on Hadoop
Thank you!
Q & A

Weitere ähnliche Inhalte

Was ist angesagt?

Implementation of page table
Implementation of page tableImplementation of page table
Implementation of page table
guestff64339
 
Distributed computing
Distributed computingDistributed computing
Distributed computing
shivli0769
 
introduction to graph theory
introduction to graph theoryintroduction to graph theory
introduction to graph theory
Chuckie Balbuena
 

Was ist angesagt? (20)

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Implementation of page table
Implementation of page tableImplementation of page table
Implementation of page table
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Distributed computing
Distributed computingDistributed computing
Distributed computing
 
CUDA Architecture
CUDA ArchitectureCUDA Architecture
CUDA Architecture
 
Two phase commit protocol in dbms
Two phase commit protocol in dbmsTwo phase commit protocol in dbms
Two phase commit protocol in dbms
 
Major and Minor Elements of Object Model
Major and Minor Elements of Object ModelMajor and Minor Elements of Object Model
Major and Minor Elements of Object Model
 
MongoDB: Advance concepts - Replication and Sharding
MongoDB: Advance concepts - Replication and ShardingMongoDB: Advance concepts - Replication and Sharding
MongoDB: Advance concepts - Replication and Sharding
 
Unit-3_BDA.ppt
Unit-3_BDA.pptUnit-3_BDA.ppt
Unit-3_BDA.ppt
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Distribution transparency and Distributed transaction
Distribution transparency and Distributed transactionDistribution transparency and Distributed transaction
Distribution transparency and Distributed transaction
 
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
 
Relational algebra in dbms
Relational algebra in dbmsRelational algebra in dbms
Relational algebra in dbms
 
Ch7 Process Synchronization galvin
Ch7 Process Synchronization galvinCh7 Process Synchronization galvin
Ch7 Process Synchronization galvin
 
Relational algebra ppt
Relational algebra pptRelational algebra ppt
Relational algebra ppt
 
introduction to graph theory
introduction to graph theoryintroduction to graph theory
introduction to graph theory
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Dichotomy of parallel computing platforms
Dichotomy of parallel computing platformsDichotomy of parallel computing platforms
Dichotomy of parallel computing platforms
 
Client Centric Consistency Model
Client Centric Consistency ModelClient Centric Consistency Model
Client Centric Consistency Model
 

Ähnlich wie MapReduce and the New Software Stack

Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
Csaba Toth
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
Pallav Jha
 
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfmodule3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
TSANKARARAO
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
Mohit Tare
 

Ähnlich wie MapReduce and the New Software Stack (20)

Hadoop
HadoopHadoop
Hadoop
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Hadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataHadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big Data
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - Hadoop
 
ENAR short course
ENAR short courseENAR short course
ENAR short course
 
try
trytry
try
 
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoop
 
hadoop
hadoophadoop
hadoop
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive Computing
 
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfmodule3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
 
Big Data.pptx
Big Data.pptxBig Data.pptx
Big Data.pptx
 
Hadoop
HadoopHadoop
Hadoop
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 

Kürzlich hochgeladen

VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
MsecMca
 

Kürzlich hochgeladen (20)

VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdf
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 

MapReduce and the New Software Stack

  • 1. MapReduce and the New Software Stack Maruf Aytekin PhD Student BAU Computer Engineering Department Besiktas/Istanbul January 5, 2015
  • 2. Outline • Introduction • DFS • MapReduce • Examples • Matrix Calculation on Hadoop
  • 3. Introduction Modern data-mining or ML applications, called «big-data analysis» requires us to manage massive amounts of data quickly.
  • 4. Important Examples • The ranking of Web pages by importance, which involves an iterated matrix-vector multiplication where the dimension is many billions. • Searches in social-networking sites, which involve graphs with hundreds of millions of nodes and many billions of edges. • Processing large amount of text or streams such as news recommendation.
  • 5. New software stack • Not a “supercomputer” (Beowulf etc.) • “computing clusters” – large collections of commodity hardware, including conventional processors (“compute nodes”) connected by Ethernet cables or inexpensive switches.
  • 6. Distributed File System • The new form of file system which features much larger units than the disk blocks in a conventional operating system. • Files can be enormous, possibly a terabytes in size. • Files are rarely updated.
  • 7. Physical Organization • Files are divided into chunks • Chunks are replicated
  • 8. DFS Implementations • The Google File System (GFS) • Hadoop Distributed File System (HDFS) • CloudStore, by Kosmix
  • 11. MapReduce Style of computing/framework/pattern. Implementations: • MapReduce by Google (internal) • Hadoop by the Apache Foundation.
  • 12. MapReduce Operates exclusively on <key, value> pairs. (input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)
  • 14. MapReduce In brief, a MapReduce computation executes as follows: • Chunks from a DFS are given to Map tasks. • These Map tasks turn the chunks into a sequence of <key, value> pairs. • The <key,value> pairs from each Map task are collected by a master controller and sorted by key. (Combine) • The keys are divided among all the Reduce tasks, so all <key,value> pairs with the same key wind up at the same Reduce task. • The Reduce tasks work on one key at a time and processes values for that key then outputs the results as <key,value> pairs.
  • 16. Hello World Word Count • file01: Hello World Bye World • file02: Hello Hadoop Goodbye Hadoop
  • 17. Word Count For the given sample input the first map emits: < Hello, 1 > < World, 1 > < Bye, 1 > < World, 1 > The second map emits: < Hello, 1 > < Hadoop, 1 > < Goodbye, 1 > < Hadoop, 1 > Combiner: After being sorted on the keys: The output of the first map: < Bye, 1 > < Hello, 1 > < World, 2 > The output of the second map: < Goodbye, 1 > < Hadoop, 2 > < Hello, 1 >
  • 18. Word Count Thus the output of the job is: < Bye, 1 > < Goodbye, 1 > < Hadoop, 2 > < Hello, 2 > < World, 2 > The Reducer implementation, via the reduce method just sums up the values, which are the occurrence counts for each key.
  • 19. M j İ N k j Matrix Calculation P = M N k i Matrix Data Model for MapReduce: M (i, j,mij ) N (j, k, njk) P(1,1) P(1,2)
  • 20. Matrix Data Files 
 for MapReduce M,0,0,10.0 M,0,2,9.0 M,0,3,9.0 M,1,0,1.0 M,1,1,3.0 M,1,2,18.0 M,1,3,25.2 . . . M, i, j, mij N,0,0,1.0 N,0,2,3.0 N,0,4,2.0 N,1,0,2.0 N,3,2,-1.0 N,3,6,4.0 N,4,6,5.0 . . . N (j, k, njk)
  • 21. Map
  • 24. Map Task Matrix M key, value pairs produced as follows: Matrix N key, value pairs produced as follows:
  • 27. Application • Run the application on Hadoop