MapReduce : Simplified Data Processing on Large Clusters

•

1 gefällt mir•902 views

Abolfazl Asudeh

Bildung Technologie Business

Map Reduce
7/4/20132
 Patented by Google
 A parallel programming model
 and an associated implementation
 for processing and generating large datasets
 Users specify the computation in terms of a map and
a reduce function
 The system automatically parallelizes the
computation across large-scale clusters

7/4/20133
 Previously, users had to handle the parallelization of
the programs over hundred or thousand machines
 Distribute the data
 Handle Failure
 Schedule inter-machine communications – to make efficient
use of resources
 By Experiment:
 most of their computations involved applying a map
operation to produce intermediate key/value pairs
 Then applying a reduce operation to
combine/aggregate the pairs

Programming Model
7/4/20134
 Input: a set of input key/value pairs
 Output: a set of output key/value pairs
 Map (written by the user):
 takes an input pair and produces a set of intermediate
key/value pairs.
 MapReduce library:
 groups all intermediate values associated with the same
intermediate key and passes them to the reduce function.
 Reduce (written by the user):
 accepts an intermediate key and a set of values for that
key.
 It merges these values together to form a possibly smaller
set of values

Basic Example: Word Counting
7/4/20137
<How,1>
<now,1>
<brown,1>
<cow,1>
<How,1>
<does,1>
<it,1>
<work,1>
<now,1>
<How,1 1>
<now,1 1>
<brown,1>
<cow,1>
<does,1>
<it,1>
<work,1>
How now
Brown cow
How does
It work now
Input
brown 1
cow 1
does 1
How 2
it 1
now 2
work 1
Output
M
M
M
M
Map
R
R
Reduce

Execution Overview
7/4/20138
1. The MapReduce library splits the input files into M
pieces of typically 16-64MB per piece and starts up
many copies of the program on a cluster of
machines.
2. One of the workers becomes Master. It manages
assigning M map jobs and R reduce jobs to the
Workers. It picks the Idle workers and assign the
jobs.
3. The worker that is doing a Map job: reads the
corresponding input split, parses the key/value
pairs and pass to map function (by user)

Execution Overview
7/4/20139
4. the buffered pairs are written to local disk,
partitioned into R regions.
 The locations of buffered pairs on the local disk are
passed back to the master who is responsible for
forwarding these locations to the reduce workers
5. Reduce Worker remotely reads the buffered data
from the local disc of the corresponding mapper.
 Sorts the read data by Intermediate key and group the
results together.

Execution Overview
7/4/201310
6. The reduce worker passes the results for each
intermediate key to the reduce function
7. When all the tasks are done, the Map-Reduce
function returns back to the user program

Fault Tolerance
7/4/201312
 Failure: if a worker does not response the PING of
master
 If a map worker Fails:
 Reschedule the WHOLE map tasks (because it writes on
the local disk)
 Send the results Address in the new map worker to all
corresponding reduce workers (if the did not still read
from the previous mapper, read from the new one)
 If a reduce worker Fails:
 Completed reduce tasks do not need to be re-executed
since their output is stored in a global file system

Execution Optimization
7/4/201313
 Locality
 Network bandwidth is a relatively scarce resource
 Compute on local copies which are distributed by HFDS
 Task Granularity
 Ideally, M and R should be much larger than the
number of worker machines
 Having each worker perform many different tasks
improves dynamic load balancing and also speeds
up recovery when a worker fails

Practice
7/4/201314
 Write the map and reduce functions for Page Rank
Algorithm.

Weitere ähnliche Inhalte

Was ist angesagt?

Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AIinside-BigData.com

Scylla Summit 2022: IO Scheduling & NVMe Disk ModellingScyllaDB

Linux MMAP & Ioremap introductionGene Chang

Elementary Parallel AlgorithmsHeman Pathak

Hadoop MapReduce joinsShalish VJ

Accelerating Ceph with RDMA and NVMe-oFinside-BigData.com

Improving Apache Spark by Taking Advantage of Disaggregated ArchitectureDatabricks

Introduction to SlurmCSUC - Consorci de Serveis Universitaris de Catalunya

Hadoop Map ReduceVNIT-ACM Student Chapter

Map ReduceVigen Sahakyan

Linux directory structure by jitu mistryJITU MISTRY

Migrating from InnoDB and HBase to MyRocks at FacebookMariaDB plc

GDB Rocks!Kent Chen

Ceph RBD Update - June 2021Ceph Community

Introduction to MongoDBMike Dirolf

Disaggregating Ceph using NVMeoFShapeBlue

Understanding eBPF in a Hurry!Ray Jenkins

Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera, Inc.

Android binder-ipcmagoroku Yamamoto

Linux Networking ExplainedThomas Graf

Was ist angesagt? (20)

Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI

Scylla Summit 2022: IO Scheduling & NVMe Disk Modelling

Linux MMAP & Ioremap introduction

Elementary Parallel Algorithms

Hadoop MapReduce joins

Accelerating Ceph with RDMA and NVMe-oF

Improving Apache Spark by Taking Advantage of Disaggregated Architecture

Introduction to Slurm

Hadoop Map Reduce

Map Reduce

Linux directory structure by jitu mistry

Migrating from InnoDB and HBase to MyRocks at Facebook

GDB Rocks!

Ceph RBD Update - June 2021

Introduction to MongoDB

Disaggregating Ceph using NVMeoF

Understanding eBPF in a Hurry!

Cloudera Impala: A Modern SQL Engine for Apache Hadoop

Android binder-ipc

Linux Networking Explained

Ähnlich wie MapReduce : Simplified Data Processing on Large Clusters

MapReduce: Ordering and Large-Scale Indexing on Large ClustersIRJET Journal

Presented by Ahmed Abdulhakim Al-Absi - Scaling map reduce applications acro...Absi Ahmed

E031201032036ijceronline

Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...Yahoo Developer Network

Map reduce presentationateeq ateeq

Introduction to map reduceM Baddar

MapReduce ParadigmNilaNila16

MapReduce.pptxssuserb8d5cb

Map reduceShahbaz Sidhu

2004 map reduce simplied data processing on large clusters (mapreduce)anh tuan

Mapreduce2008 cacmlmphuong06

Managing Big data Module 3 (1st part)Soumee Maschatak

MapReduceahmedelmorsy89

Hadoop & MapReduceNewvewm

PREGEL a system for large scale graph processingAbolfazl Asudeh

Report Hadoop Map ReduceUrvashi Kataria

A Brief on MapReduce PerformanceAM Publications

MapReduce-Notes.pdfAnilVijayagiri

Hadoop interview questions - Softwarequery.comsoftwarequery

Hadoop Mapreduce Performance Enhancement Using In-Node Combinersijcsit

Ähnlich wie MapReduce : Simplified Data Processing on Large Clusters (20)

MapReduce: Ordering and Large-Scale Indexing on Large Clusters

Presented by Ahmed Abdulhakim Al-Absi - Scaling map reduce applications acro...

E031201032036

Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...

Map reduce presentation

Introduction to map reduce

MapReduce Paradigm

MapReduce.pptx

Map reduce

2004 map reduce simplied data processing on large clusters (mapreduce)

Mapreduce2008 cacm

Managing Big data Module 3 (1st part)

MapReduce

Hadoop & MapReduce

PREGEL a system for large scale graph processing

Report Hadoop Map Reduce

A Brief on MapReduce Performance

MapReduce-Notes.pdf

Hadoop interview questions - Softwarequery.com

Hadoop Mapreduce Performance Enhancement Using In-Node Combiners

Kürzlich hochgeladen

INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxRAM LAL ANAND COLLEGE, DELHI UNIVERSITY.

Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019

Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron

Holdier Curriculum Vitae (April 2024).pdfagholdier

Interactive Powerpoint_How to Master effective communicationnomboosow

Activity 01 - Artificial Culture (1).pdfciinovamais

Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha

BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur

Software Engineering Methodologies (overview)eniolaolutunde

microwave assisted reaction. General introductionMaksud Ahmed

social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3

Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics

Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande

Arihant handbook biology for class 11 .pdfchloefrazer622

9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt

Mattingly "AI & Prompt Design: The Basics of Prompt Design"National Information Standards Organization (NISO)

Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD

Sports & Fitness Value Added Course FY..Disha Kariya

Nutritional Needs Presentation - HLTH 104misteraugie

Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K

Kürzlich hochgeladen (20)

INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx

Sanyam Choudhary Chemistry practical.pdf

Q4-W6-Restating Informational Text Grade 3

Holdier Curriculum Vitae (April 2024).pdf

Interactive Powerpoint_How to Master effective communication

Activity 01 - Artificial Culture (1).pdf

Call Girls in Dwarka Mor Delhi Contact Us 9654467111

BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...

Software Engineering Methodologies (overview)

microwave assisted reaction. General introduction

social pharmacy d-pharm 1st year by Pragati K. Mahajan

Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...

Web & Social Media Analytics Previous Year Question Paper.pdf

Arihant handbook biology for class 11 .pdf

9548086042 for call girls in Indira Nagar with room service

Mattingly "AI & Prompt Design: The Basics of Prompt Design"

Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...

Sports & Fitness Value Added Course FY..

Nutritional Needs Presentation - HLTH 104

Measures of Dispersion and Variability: Range, QD, AD and SD

MapReduce : Simplified Data Processing on Large Clusters

1. by Jeffrey Dean and Sanjay Ghemawat Communication of the ACM, 2008 Presented by: Abolfazl Asudeh MapReduce: Simplified Data Processing on Large Clusters

2. Map Reduce 7/4/20132  Patented by Google  A parallel programming model  and an associated implementation  for processing and generating large datasets  Users specify the computation in terms of a map and a reduce function  The system automatically parallelizes the computation across large-scale clusters

3. 7/4/20133  Previously, users had to handle the parallelization of the programs over hundred or thousand machines  Distribute the data  Handle Failure  Schedule inter-machine communications – to make efficient use of resources  By Experiment:  most of their computations involved applying a map operation to produce intermediate key/value pairs  Then applying a reduce operation to combine/aggregate the pairs

4. Programming Model 7/4/20134  Input: a set of input key/value pairs  Output: a set of output key/value pairs  Map (written by the user):  takes an input pair and produces a set of intermediate key/value pairs.  MapReduce library:  groups all intermediate values associated with the same intermediate key and passes them to the reduce function.  Reduce (written by the user):  accepts an intermediate key and a set of values for that key.  It merges these values together to form a possibly smaller set of values

5. Basic Example: Word Counting 7/4/20135

6. Basic Example: Word Counting 7/4/20136

7. Basic Example: Word Counting 7/4/20137 <How,1> <now,1> <brown,1> <cow,1> <How,1> <does,1> <it,1> <work,1> <now,1> <How,1 1> <now,1 1> <brown,1> <cow,1> <does,1> <it,1> <work,1> How now Brown cow How does It work now Input brown 1 cow 1 does 1 How 2 it 1 now 2 work 1 Output M M M M Map R R Reduce

8. Execution Overview 7/4/20138 1. The MapReduce library splits the input files into M pieces of typically 16-64MB per piece and starts up many copies of the program on a cluster of machines. 2. One of the workers becomes Master. It manages assigning M map jobs and R reduce jobs to the Workers. It picks the Idle workers and assign the jobs. 3. The worker that is doing a Map job: reads the corresponding input split, parses the key/value pairs and pass to map function (by user)

9. Execution Overview 7/4/20139 4. the buffered pairs are written to local disk, partitioned into R regions.  The locations of buffered pairs on the local disk are passed back to the master who is responsible for forwarding these locations to the reduce workers 5. Reduce Worker remotely reads the buffered data from the local disc of the corresponding mapper.  Sorts the read data by Intermediate key and group the results together.

10. Execution Overview 7/4/201310 6. The reduce worker passes the results for each intermediate key to the reduce function 7. When all the tasks are done, the Map-Reduce function returns back to the user program

11. Execution Overview 7/4/201311

12. Fault Tolerance 7/4/201312  Failure: if a worker does not response the PING of master  If a map worker Fails:  Reschedule the WHOLE map tasks (because it writes on the local disk)  Send the results Address in the new map worker to all corresponding reduce workers (if the did not still read from the previous mapper, read from the new one)  If a reduce worker Fails:  Completed reduce tasks do not need to be re-executed since their output is stored in a global file system

13. Execution Optimization 7/4/201313  Locality  Network bandwidth is a relatively scarce resource  Compute on local copies which are distributed by HFDS  Task Granularity  Ideally, M and R should be much larger than the number of worker machines  Having each worker perform many different tasks improves dynamic load balancing and also speeds up recovery when a worker fails

14. Practice 7/4/201314  Write the map and reduce functions for Page Rank Algorithm.

15. Thank you 7/4/201315

MapReduce : Simplified Data Processing on Large Clusters

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie MapReduce : Simplified Data Processing on Large Clusters

Ähnlich wie MapReduce : Simplified Data Processing on Large Clusters (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

MapReduce : Simplified Data Processing on Large Clusters