SlideShare ist ein Scribd-Unternehmen logo
1 von 21
MapReduce:
Simplified Data
Processing on
Large Clusters
Papers We Love
Bucharest Chapter
October 12, 2015
I am Adrian Florea
Architect Lead @ IBM Bucharest Software Lab
Hello!
What is?
◉MapReduce is a programming model and implementation for processing large data sets
◉Users specify a Map function that processes a key-value pair to generate a set of
intermediate key-value pairs and a Reduce function that aggregates all intermediate
values that share the same intermediate key in order to combine the derived data
appropriately
map:(k1, v1)->[(k2, v2)]
reduce:(k2, [v2])->[(k3, v3)]
Map
◉function (mathematics)
◉map (Java)
◉Select (.NET)
Where else have we seen this?
Reduce
◉fold (functional programming)
◉reduce (Java)
◉Aggregate (.NET)
Other analogies
"To draw an analogy to SQL, map is like
the group-by clause of an aggregate
query. Reduce is analogous to the
aggregate function that is computed
over all the rows with the same group-
by attribute"
D.J. DeWitt & M. Stonebraker
Divide-and-conquer algorithms
“recursively breaking down a problem into two
or more sub-problems of the same (or related)
type (divide), until these become simple enough
to be solved directly (conquer). The solutions to
the sub-problems are then combined to give a
solution to the original problem.”
Jeffrey Dean
Google Fellow
December 6, 2004 4PM
Sanjay Ghemawat
Google Fellow
History
◉April 1960: John McCarthy introduced the concept of “maplist”
◉September 4, 1998: Google founded
◉1998-2003: hundreds of special-purpose large data computation
programs in Google
◉February 2003: 1st version of MapReduce
◉August 2003: MapReduce significant enhancements
◉June 18, 2004: Patent US7650331 B1 filed
◉December 6, 2004: 1st MapReduce public presentation
◉2005: Hadoop implementation started in Java (Douglass R. Cutting &
Michael J. Cafarella)
◉September 4, 2007: Hadoop 0.14.1
◉January 19, 2010: Patent US7650331 B1 published
◉July 6, 2015: Hadoop 2.7.1
Distribution issues
◉Communication and routing
which nodes should be involved?
what transport protocol should be used?
threads/events/connections management
remote execution of your processing code?
◉Fault tolerance and fault detection
◉Load balancing / partitioning of data
heterogeneity of nodes
skew in data
network topology
◉Parallelization strategy
algorithmic issues of work splitting
“without having to deal with
failures, the rest of the support
code just isn’t so complicate”
S. Ghemawat
MapReduce model
Map Operation Reduce Operation
Input Data Intermediate Data Output Data
application-independent
Map Module
application-independent
Reduce Module
MapReduce system
application-specific
Map Operation
application-specific
Reduce OperationInput
Data
Intermediate
Data
Output
Data
Original article Hadoop wiki
Original article Hadoop wiki
MapReduce model in practice
map(String key, String value):
for each Word w in value:
EmitIntermediate(w, "1");
reduce(String key, Iterator values)
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(key, AsString(result));
map:(k1, v1)->[(k2, v2)]
reduce:(k2, [v2])->[(k3, v3)]
How long does it take to go
through 1 TB?
Sequentially: 3 hours
MapReduce: startup overhead 70 seconds + computation 80 seconds
Environment
◉ 1800 Linux dual-processorx86 machines, 2-4 GB memory
◉ Fast Ethernet/Giga Ethernet
◉ Inexpensive IDE disks and a distributed Google File System
Take a walk through a Google data center
Tianhe-2, #1 supercomputer: 3,120,000 cores, 1,5 PB total memory
Execution diagram
◉Master process is a task itself initiated by the WQM and is responsible for
assigning all other tasks to worker processes
◉Each worker invokes at least a map thread and a reduce thread
◉If a worker fails, its tasks are reassigned to another worker process
◉When WQM receives a job, it allocates the job to the master that calculates
and requires M+R+1 processes to be allocated to the job
◉WQM responds with the process allocation info (can result less processes) to
the master that will manage the performance of the job
◉Reduce tasks begin work when the master informs them that there are
intermediate files ready
◉Input data (files/DB/memory) are splitted in data blocks (16-64 MB)
automatically or configurable
◉The worker to which a map task has been assigned applies the map() operator
to the respective input data block
◉When the worker completes the task, it informs the master of the status
◉Master informs workers where to find intermediate data and schedules their
reading
◉Workers (3 & 4) sort the intermediate key-value pairs, then merge (by applying
reduce()) them and write to output
Workflow diagram
◉When a process completes a task it informs WQM
which updates the status tables
◉When WQM discovers one process failed, it assign its
tasks to a new process and updates the status tables
Task Status Table
◉TaskID
◉Status (InProgress, Waiting, Completed, Failed)
◉ProcessID
◉InputFiles (Input, Intermediate)
◉OutputFiles
Process Status Table
◉ProcessID
◉Status (Idle, Busy, Failed)
◉Location (CPU ID, etc.)
◉Current (TaskID, WQM)
Questions from the audience
@ original paper presentation
◉Q: Wanted to know of any task that could not be handled using MapReduce?
A: join operations could not be performed with the current model
◉Q: Wondered how MapReduce differs from parallel databases?
A: MapReduce is stored across a large number of machines as compared to parallel databases,
the abstractions are fairly simple to use in MapReduce, and
MapReduce also benefits greatly from locality optimizations
Bibliography
◉ J. Dean, S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters", OSDI'04,
Dec. 6, 2004
◉ J. Dean, S. Ghemawat, “System and method for efficient large-scale data processing”, Patent
US 7650331 B1, Jan. 19, 2010
◉ S. Ghemawat, J. Dean, J. Zhao, M. Austern, A. Spector, "Google Technology RoundTable: Map
Reduce", Aug. 21, 2008 – Youtube
◉ P. Mahadevan, "OSDI'04 Conference Reports", ;LOGIN: Vol. 30, No. 2, Apr. 2005, p. 61
◉ R. Jacotin, “Lecture: The Google MapReduce”, SlideShare, October 3, 2014
Any questions ?
You can find me at
Thanks!

Weitere ähnliche Inhalte

Was ist angesagt?

06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
Subhas Kumar Ghosh
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparator
Subhas Kumar Ghosh
 

Was ist angesagt? (20)

Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Map reduce
Map reduceMap reduce
Map reduce
 
Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operation
 
Hadoop map reduce v2
Hadoop map reduce v2Hadoop map reduce v2
Hadoop map reduce v2
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce script
 
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Map reduce in Hadoop
Map reduce in HadoopMap reduce in Hadoop
Map reduce in Hadoop
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparator
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling Algorithms
 
E031201032036
E031201032036E031201032036
E031201032036
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
MapReduce
MapReduceMapReduce
MapReduce
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
 
Hadoop
HadoopHadoop
Hadoop
 

Andere mochten auch

Application form FOR LINK'D IN
Application form FOR LINK'D INApplication form FOR LINK'D IN
Application form FOR LINK'D IN
Lisa Haines
 
Bachelors of Commerce Degree Final Certificate
Bachelors of Commerce Degree Final CertificateBachelors of Commerce Degree Final Certificate
Bachelors of Commerce Degree Final Certificate
Shadrack Mwenda
 

Andere mochten auch (18)

Mapas mentales
Mapas mentalesMapas mentales
Mapas mentales
 
Modern History
Modern HistoryModern History
Modern History
 
Infortask | O sistema mais simples e prático para organizar atividades!
Infortask | O sistema mais simples e prático para organizar atividades!Infortask | O sistema mais simples e prático para organizar atividades!
Infortask | O sistema mais simples e prático para organizar atividades!
 
Cv Pradipta
Cv PradiptaCv Pradipta
Cv Pradipta
 
Application form FOR LINK'D IN
Application form FOR LINK'D INApplication form FOR LINK'D IN
Application form FOR LINK'D IN
 
Veneers/ fixed orthodontics courses
Veneers/ fixed orthodontics coursesVeneers/ fixed orthodontics courses
Veneers/ fixed orthodontics courses
 
10 reasons for not booking your flight on go to gate
10 reasons for not booking your flight on go to gate10 reasons for not booking your flight on go to gate
10 reasons for not booking your flight on go to gate
 
BWS - Profile
BWS - ProfileBWS - Profile
BWS - Profile
 
Subject Selection - Industrial Arts
Subject Selection - Industrial ArtsSubject Selection - Industrial Arts
Subject Selection - Industrial Arts
 
Seguridad informatica
Seguridad informaticaSeguridad informatica
Seguridad informatica
 
Jim Hemmington, Head of Procurement at BBC - Developing and Implementing a St...
Jim Hemmington, Head of Procurement at BBC - Developing and Implementing a St...Jim Hemmington, Head of Procurement at BBC - Developing and Implementing a St...
Jim Hemmington, Head of Procurement at BBC - Developing and Implementing a St...
 
Failures in fpd/ orthodontics courses in india
Failures in fpd/ orthodontics courses in indiaFailures in fpd/ orthodontics courses in india
Failures in fpd/ orthodontics courses in india
 
500 important spoken tamil situations into spoken english sentences sample
500 important spoken tamil situations into spoken english sentences   sample500 important spoken tamil situations into spoken english sentences   sample
500 important spoken tamil situations into spoken english sentences sample
 
JKUAT Degree Cert
JKUAT Degree CertJKUAT Degree Cert
JKUAT Degree Cert
 
Bachelors of Commerce Degree Final Certificate
Bachelors of Commerce Degree Final CertificateBachelors of Commerce Degree Final Certificate
Bachelors of Commerce Degree Final Certificate
 
Real-Time Supply Chain Analytics with Machine Learning, Kafka, and Spark
Real-Time Supply Chain Analytics with Machine Learning, Kafka, and SparkReal-Time Supply Chain Analytics with Machine Learning, Kafka, and Spark
Real-Time Supply Chain Analytics with Machine Learning, Kafka, and Spark
 
Learning strategies
Learning strategiesLearning strategies
Learning strategies
 
Basic components of computer system
Basic components  of computer systemBasic components  of computer system
Basic components of computer system
 

Ähnlich wie "MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation @ Papers We Love Bucharest

Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
Sri Prasanna
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Somnath Mazumdar
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
Pallav Jha
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 
Map reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingMap reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreading
coolmirza143
 
Map reduce
Map reduceMap reduce
Map reduce
xydii
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
gothicane
 

Ähnlich wie "MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation @ Papers We Love Bucharest (20)

MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Unit3 MapReduce
Unit3 MapReduceUnit3 MapReduce
Unit3 MapReduce
 
Beyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingBeyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel Processing
 
Mapreduce Osdi04
Mapreduce Osdi04Mapreduce Osdi04
Mapreduce Osdi04
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
 
An Introduction to MapReduce
An Introduction to MapReduce An Introduction to MapReduce
An Introduction to MapReduce
 
Lecture 1 mapreduce
Lecture 1  mapreduceLecture 1  mapreduce
Lecture 1 mapreduce
 
Map reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingMap reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreading
 
Map reduce
Map reduceMap reduce
Map reduce
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 

Kürzlich hochgeladen

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
anilsa9823
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 

Kürzlich hochgeladen (20)

5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 

"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation @ Papers We Love Bucharest

  • 1. MapReduce: Simplified Data Processing on Large Clusters Papers We Love Bucharest Chapter October 12, 2015
  • 2. I am Adrian Florea Architect Lead @ IBM Bucharest Software Lab Hello!
  • 3. What is? ◉MapReduce is a programming model and implementation for processing large data sets ◉Users specify a Map function that processes a key-value pair to generate a set of intermediate key-value pairs and a Reduce function that aggregates all intermediate values that share the same intermediate key in order to combine the derived data appropriately map:(k1, v1)->[(k2, v2)] reduce:(k2, [v2])->[(k3, v3)]
  • 4. Map ◉function (mathematics) ◉map (Java) ◉Select (.NET) Where else have we seen this? Reduce ◉fold (functional programming) ◉reduce (Java) ◉Aggregate (.NET)
  • 5. Other analogies "To draw an analogy to SQL, map is like the group-by clause of an aggregate query. Reduce is analogous to the aggregate function that is computed over all the rows with the same group- by attribute" D.J. DeWitt & M. Stonebraker Divide-and-conquer algorithms “recursively breaking down a problem into two or more sub-problems of the same (or related) type (divide), until these become simple enough to be solved directly (conquer). The solutions to the sub-problems are then combined to give a solution to the original problem.”
  • 6. Jeffrey Dean Google Fellow December 6, 2004 4PM Sanjay Ghemawat Google Fellow
  • 7. History ◉April 1960: John McCarthy introduced the concept of “maplist” ◉September 4, 1998: Google founded ◉1998-2003: hundreds of special-purpose large data computation programs in Google ◉February 2003: 1st version of MapReduce ◉August 2003: MapReduce significant enhancements ◉June 18, 2004: Patent US7650331 B1 filed ◉December 6, 2004: 1st MapReduce public presentation ◉2005: Hadoop implementation started in Java (Douglass R. Cutting & Michael J. Cafarella) ◉September 4, 2007: Hadoop 0.14.1 ◉January 19, 2010: Patent US7650331 B1 published ◉July 6, 2015: Hadoop 2.7.1
  • 8. Distribution issues ◉Communication and routing which nodes should be involved? what transport protocol should be used? threads/events/connections management remote execution of your processing code? ◉Fault tolerance and fault detection ◉Load balancing / partitioning of data heterogeneity of nodes skew in data network topology ◉Parallelization strategy algorithmic issues of work splitting “without having to deal with failures, the rest of the support code just isn’t so complicate” S. Ghemawat
  • 9. MapReduce model Map Operation Reduce Operation Input Data Intermediate Data Output Data
  • 10. application-independent Map Module application-independent Reduce Module MapReduce system application-specific Map Operation application-specific Reduce OperationInput Data Intermediate Data Output Data
  • 13. MapReduce model in practice map(String key, String value): for each Word w in value: EmitIntermediate(w, "1"); reduce(String key, Iterator values) int result = 0; for each v in values: result += ParseInt(v); Emit(key, AsString(result)); map:(k1, v1)->[(k2, v2)] reduce:(k2, [v2])->[(k3, v3)]
  • 14. How long does it take to go through 1 TB? Sequentially: 3 hours MapReduce: startup overhead 70 seconds + computation 80 seconds Environment ◉ 1800 Linux dual-processorx86 machines, 2-4 GB memory ◉ Fast Ethernet/Giga Ethernet ◉ Inexpensive IDE disks and a distributed Google File System
  • 15. Take a walk through a Google data center
  • 16. Tianhe-2, #1 supercomputer: 3,120,000 cores, 1,5 PB total memory
  • 17. Execution diagram ◉Master process is a task itself initiated by the WQM and is responsible for assigning all other tasks to worker processes ◉Each worker invokes at least a map thread and a reduce thread ◉If a worker fails, its tasks are reassigned to another worker process ◉When WQM receives a job, it allocates the job to the master that calculates and requires M+R+1 processes to be allocated to the job ◉WQM responds with the process allocation info (can result less processes) to the master that will manage the performance of the job ◉Reduce tasks begin work when the master informs them that there are intermediate files ready ◉Input data (files/DB/memory) are splitted in data blocks (16-64 MB) automatically or configurable ◉The worker to which a map task has been assigned applies the map() operator to the respective input data block ◉When the worker completes the task, it informs the master of the status ◉Master informs workers where to find intermediate data and schedules their reading ◉Workers (3 & 4) sort the intermediate key-value pairs, then merge (by applying reduce()) them and write to output
  • 18. Workflow diagram ◉When a process completes a task it informs WQM which updates the status tables ◉When WQM discovers one process failed, it assign its tasks to a new process and updates the status tables Task Status Table ◉TaskID ◉Status (InProgress, Waiting, Completed, Failed) ◉ProcessID ◉InputFiles (Input, Intermediate) ◉OutputFiles Process Status Table ◉ProcessID ◉Status (Idle, Busy, Failed) ◉Location (CPU ID, etc.) ◉Current (TaskID, WQM)
  • 19. Questions from the audience @ original paper presentation ◉Q: Wanted to know of any task that could not be handled using MapReduce? A: join operations could not be performed with the current model ◉Q: Wondered how MapReduce differs from parallel databases? A: MapReduce is stored across a large number of machines as compared to parallel databases, the abstractions are fairly simple to use in MapReduce, and MapReduce also benefits greatly from locality optimizations
  • 20. Bibliography ◉ J. Dean, S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters", OSDI'04, Dec. 6, 2004 ◉ J. Dean, S. Ghemawat, “System and method for efficient large-scale data processing”, Patent US 7650331 B1, Jan. 19, 2010 ◉ S. Ghemawat, J. Dean, J. Zhao, M. Austern, A. Spector, "Google Technology RoundTable: Map Reduce", Aug. 21, 2008 – Youtube ◉ P. Mahadevan, "OSDI'04 Conference Reports", ;LOGIN: Vol. 30, No. 2, Apr. 2005, p. 61 ◉ R. Jacotin, “Lecture: The Google MapReduce”, SlideShare, October 3, 2014
  • 21. Any questions ? You can find me at Thanks!