SlideShare ist ein Scribd-Unternehmen logo
1 von 24
Map & Reduce Christopher Schleiden, Christian Corsten, Michael Lottko, Jinhui Li 1 The slides are licensed under aCreative Commons Attribution 3.0 License
Outline Motivation Concept Parallel Map & Reduce Google’s MapReduce Example: Word Count Demo: Hadoop Summary Web Technologies 2
Today the web is all about data! Google Processing of 20 PB/day (2008) LHC Will generate about 15PB/year Facebook 2.5 PB of data + 15 TB/day (4/2009) 3 BUT: It takes ~2.5 hours to read one terabyte off a typical hard disk!
4 Solution: Going Parallel! Data Distribution However, parallel programming is hard!  Synchronization Load Balancing …
Map & Reduce Programming model and Framework  Designed for large volumes of data in parallel Based on functional map and reduce concept e.g., Output of functions only depends on their input, there are no side-effects 5
Functional Concept Map Apply function to each value of a sequence map(k,v)  <k’, v’>* Reduce/Fold Combine all elements of a sequence using binary operator  reduce(k’, <v’>*) <k’, v’>* 6
Typical problem Iterate over large number of records Extract something interesting Shuffle & sort intermediate results Aggregate intermediate results Write final output 7 Map Reduce
Parallel Map & Reduce 8
Parallel Map & Reduce Published (2004) and patented (2010) by Google Inc C++ Runtime with Bindings to Java/Python Other Implementations: Apache Hadoop/Hive project (Java) Developed at Yahoo! Used by: Facebook Hulu IBM And many more Microsoft COSMOS (Scope, based on SQL and C#) Starfish (Ruby) …  9 Footer Text
Parallel Map & Reduce /2 Parallel execution of Map and Reduce stages Scheduling through Master/Worker pattern Runtime handles: Assigning workers to map and reduce tasks Data distribution Detects crashed workers 10
Parallel Map & Reduce Execution 11 Map Reduce Input Output Shuffle & Sort D RE A SU T LT A
Components in Google’s MapReduce Web Technologies 12
Google Filesystem (GFS) Stores… Input data Intermediate results Final results …in 64MB chunks on at least three different machines Web Technologies 13 File Nodes
Scheduling (Master/Worker) One master, many worker Input data split into M map tasks (~64MB in Size; GFS) Reduce phase partitioned into R tasks Tasks are assigned to workers dynamically Master assigns each map task to a free worker Master assigns each reducetask to a free worker Fault handling via Redundancy Master checks if Worker still alive via heart-beat Reschedules work item if worker has died Web Technologies 14
Scheduling Example 15 Map Reduce Input Output Temp Master Assign map Assign reduce D Worker Worker RES A Worker T Worker ULT Worker A
Googles M&R vsHadoop Google MapReduce Main language: C++ Google Filesystem (GFS) GFS Master GFS chunkserver HadoopMapReduce Main language: Java HadoopFilesystem (HDFS) Hadoopnamenode Hadoopdatanode Web Technologies 16
Word Count The Map & Reduce “Hello World” example 17
Word Count - Input Set of text files: Expected Output: sweet (1), this (2), is (2), the (2), foo (1), bar (1), file (1) 18 bar.txt This is the bar file foo.txt Sweet, this is the foo file
Word Count - Map Mapper(filename, file-contents): for each word emit(word,1) Output this (1) is (1) the (1) sweet (1) this (1) the (1)  is (1)  foo (1)  bar (1)  file (1) 19
Word Count – Shuffle Sort this (1) is (1) the (1) sweet (1) this (1) the (1)  is (1)  foo (1)  bar (1)  file (1) this (1) this (1) is (1) is (1)  the (1) the (1)  sweet (1) foo (1)  bar (1)  file (1) 20
Word Count - Reduce reducer(word, values): sum = 0 for each value in values: sum = sum + value emit(word,sum) Output sweet (1) this (2) is (2) the (2) foo (1) bar (1)  file (1) 21
DEMO Hadoop – Word Count 22
Summary Lots of data processed on the web (e.g., Google) Performance solution: Go parallel Input, Map, Shuffle & Sort, Reduce, Output Google File System Scheduling: Master/Worker Word Count example Hadoop Questions? Web Technologies 23
References Inspirations for presentation http://www4.informatik.uni-erlangen.de/Lehre/WS10/V_MW/Uebung/folien/05-Map-Reduce-Framework.pdf http://www.scribd.com/doc/23844299/Map-Reduce-Hadoop-Pig RWTH Map Reduce Talk: http://bit.ly/f5oM7p Paper Dean et al, MapReduce: Simplified Data Processing on Large Clusters, OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004 Ghemawat et al, The Google File System, 19th ACM Symposium on Operating Systems Principles, Lake George, NY, October, 2003. 24

Weitere ähnliche Inhalte

Was ist angesagt?

Riding the Elephant - Hadoop 2.0
Riding the Elephant - Hadoop 2.0Riding the Elephant - Hadoop 2.0
Riding the Elephant - Hadoop 2.0Simon Elliston Ball
 
Map-Side Merge Joins for Scalable SPARQL BGP Processing
Map-Side Merge Joins for Scalable SPARQL BGP ProcessingMap-Side Merge Joins for Scalable SPARQL BGP Processing
Map-Side Merge Joins for Scalable SPARQL BGP ProcessingAlexander Schätzle
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreKelly Technologies
 
Relational Algebra and MapReduce
Relational Algebra and MapReduceRelational Algebra and MapReduce
Relational Algebra and MapReducePietro Michiardi
 
C++ on its way to exascale and beyond -- The HPX Parallel Runtime System
C++ on its way to exascale and beyond -- The HPX Parallel Runtime SystemC++ on its way to exascale and beyond -- The HPX Parallel Runtime System
C++ on its way to exascale and beyond -- The HPX Parallel Runtime SystemThomas Heller
 
Mapreduce introduction
Mapreduce introductionMapreduce introduction
Mapreduce introductionYogender Singh
 
Graph 500 DISLIB powered optimized version
Graph 500 DISLIB powered optimized versionGraph 500 DISLIB powered optimized version
Graph 500 DISLIB powered optimized versionAnton Korzh
 
Hive integration: HBase and Rcfile__HadoopSummit2010
Hive integration: HBase and Rcfile__HadoopSummit2010Hive integration: HBase and Rcfile__HadoopSummit2010
Hive integration: HBase and Rcfile__HadoopSummit2010Yahoo Developer Network
 
EDF2012 Kostas Tzouma - Linking and analyzing bigdata - Stratosphere
EDF2012   Kostas Tzouma - Linking and analyzing bigdata - StratosphereEDF2012   Kostas Tzouma - Linking and analyzing bigdata - Stratosphere
EDF2012 Kostas Tzouma - Linking and analyzing bigdata - StratosphereEuropean Data Forum
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analyticsAvinash Pandu
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache SparkIndicThreads
 
Applying stratosphere for big data analytics
Applying stratosphere for big data analyticsApplying stratosphere for big data analytics
Applying stratosphere for big data analyticsAvinash Pandu
 
MapMap-Reduce recipes in with c#
MapMap-Reduce recipes in with c#MapMap-Reduce recipes in with c#
MapMap-Reduce recipes in with c#Erik Lebel
 
Ronalao termpresent
Ronalao termpresentRonalao termpresent
Ronalao termpresentElma Belitz
 
CartoType & OpenStreetMap
CartoType & OpenStreetMapCartoType & OpenStreetMap
CartoType & OpenStreetMapguest69c941
 
OWL reasoning with WebPIE: calculating the closer of 100 billion triples
OWL reasoning with WebPIE: calculating the closer of 100 billion triplesOWL reasoning with WebPIE: calculating the closer of 100 billion triples
OWL reasoning with WebPIE: calculating the closer of 100 billion triplesMahdi Atawneh
 
Towards a Green Ranking for Programming Languages
Towards a Green Ranking for Programming LanguagesTowards a Green Ranking for Programming Languages
Towards a Green Ranking for Programming LanguagesGreenLabAtDI
 
Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster
Amazon-style shopping cart analysis using MapReduce on a Hadoop clusterAmazon-style shopping cart analysis using MapReduce on a Hadoop cluster
Amazon-style shopping cart analysis using MapReduce on a Hadoop clusterAsociatia ProLinux
 
Presentation July 22nd
Presentation July 22ndPresentation July 22nd
Presentation July 22ndyujin tang
 

Was ist angesagt? (20)

Riding the Elephant - Hadoop 2.0
Riding the Elephant - Hadoop 2.0Riding the Elephant - Hadoop 2.0
Riding the Elephant - Hadoop 2.0
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
 
Map-Side Merge Joins for Scalable SPARQL BGP Processing
Map-Side Merge Joins for Scalable SPARQL BGP ProcessingMap-Side Merge Joins for Scalable SPARQL BGP Processing
Map-Side Merge Joins for Scalable SPARQL BGP Processing
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore
 
Relational Algebra and MapReduce
Relational Algebra and MapReduceRelational Algebra and MapReduce
Relational Algebra and MapReduce
 
C++ on its way to exascale and beyond -- The HPX Parallel Runtime System
C++ on its way to exascale and beyond -- The HPX Parallel Runtime SystemC++ on its way to exascale and beyond -- The HPX Parallel Runtime System
C++ on its way to exascale and beyond -- The HPX Parallel Runtime System
 
Mapreduce introduction
Mapreduce introductionMapreduce introduction
Mapreduce introduction
 
Graph 500 DISLIB powered optimized version
Graph 500 DISLIB powered optimized versionGraph 500 DISLIB powered optimized version
Graph 500 DISLIB powered optimized version
 
Hive integration: HBase and Rcfile__HadoopSummit2010
Hive integration: HBase and Rcfile__HadoopSummit2010Hive integration: HBase and Rcfile__HadoopSummit2010
Hive integration: HBase and Rcfile__HadoopSummit2010
 
EDF2012 Kostas Tzouma - Linking and analyzing bigdata - Stratosphere
EDF2012   Kostas Tzouma - Linking and analyzing bigdata - StratosphereEDF2012   Kostas Tzouma - Linking and analyzing bigdata - Stratosphere
EDF2012 Kostas Tzouma - Linking and analyzing bigdata - Stratosphere
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
 
Applying stratosphere for big data analytics
Applying stratosphere for big data analyticsApplying stratosphere for big data analytics
Applying stratosphere for big data analytics
 
MapMap-Reduce recipes in with c#
MapMap-Reduce recipes in with c#MapMap-Reduce recipes in with c#
MapMap-Reduce recipes in with c#
 
Ronalao termpresent
Ronalao termpresentRonalao termpresent
Ronalao termpresent
 
CartoType & OpenStreetMap
CartoType & OpenStreetMapCartoType & OpenStreetMap
CartoType & OpenStreetMap
 
OWL reasoning with WebPIE: calculating the closer of 100 billion triples
OWL reasoning with WebPIE: calculating the closer of 100 billion triplesOWL reasoning with WebPIE: calculating the closer of 100 billion triples
OWL reasoning with WebPIE: calculating the closer of 100 billion triples
 
Towards a Green Ranking for Programming Languages
Towards a Green Ranking for Programming LanguagesTowards a Green Ranking for Programming Languages
Towards a Green Ranking for Programming Languages
 
Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster
Amazon-style shopping cart analysis using MapReduce on a Hadoop clusterAmazon-style shopping cart analysis using MapReduce on a Hadoop cluster
Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster
 
Presentation July 22nd
Presentation July 22ndPresentation July 22nd
Presentation July 22nd
 

Ähnlich wie Map and Reduce

HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windowsMuhammad Shahid
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine ParallelismSri Prasanna
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
An Enhanced MapReduce Model (on BSP)
An Enhanced MapReduce Model (on BSP)An Enhanced MapReduce Model (on BSP)
An Enhanced MapReduce Model (on BSP)Yu Liu
 
Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
Sparse matrix computations in MapReduce
Sparse matrix computations in MapReduceSparse matrix computations in MapReduce
Sparse matrix computations in MapReduceDavid Gleich
 
Embarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsEmbarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsDilum Bandara
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesKelly Technologies
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview questionpappupassindia
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduceBhupesh Chawda
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesKelly Technologies
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on HadoopVivian S. Zhang
 

Ähnlich wie Map and Reduce (20)

HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
An Enhanced MapReduce Model (on BSP)
An Enhanced MapReduce Model (on BSP)An Enhanced MapReduce Model (on BSP)
An Enhanced MapReduce Model (on BSP)
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Sparse matrix computations in MapReduce
Sparse matrix computations in MapReduceSparse matrix computations in MapReduce
Sparse matrix computations in MapReduce
 
Embarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsEmbarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel Problems
 
mapReduce.pptx
mapReduce.pptxmapReduce.pptx
mapReduce.pptx
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologies
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
ch02-mapreduce.pptx
ch02-mapreduce.pptxch02-mapreduce.pptx
ch02-mapreduce.pptx
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
 
MapReduce
MapReduceMapReduce
MapReduce
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 

Kürzlich hochgeladen (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

Map and Reduce

  • 1. Map & Reduce Christopher Schleiden, Christian Corsten, Michael Lottko, Jinhui Li 1 The slides are licensed under aCreative Commons Attribution 3.0 License
  • 2. Outline Motivation Concept Parallel Map & Reduce Google’s MapReduce Example: Word Count Demo: Hadoop Summary Web Technologies 2
  • 3. Today the web is all about data! Google Processing of 20 PB/day (2008) LHC Will generate about 15PB/year Facebook 2.5 PB of data + 15 TB/day (4/2009) 3 BUT: It takes ~2.5 hours to read one terabyte off a typical hard disk!
  • 4. 4 Solution: Going Parallel! Data Distribution However, parallel programming is hard! Synchronization Load Balancing …
  • 5. Map & Reduce Programming model and Framework Designed for large volumes of data in parallel Based on functional map and reduce concept e.g., Output of functions only depends on their input, there are no side-effects 5
  • 6. Functional Concept Map Apply function to each value of a sequence map(k,v)  <k’, v’>* Reduce/Fold Combine all elements of a sequence using binary operator reduce(k’, <v’>*) <k’, v’>* 6
  • 7. Typical problem Iterate over large number of records Extract something interesting Shuffle & sort intermediate results Aggregate intermediate results Write final output 7 Map Reduce
  • 8. Parallel Map & Reduce 8
  • 9. Parallel Map & Reduce Published (2004) and patented (2010) by Google Inc C++ Runtime with Bindings to Java/Python Other Implementations: Apache Hadoop/Hive project (Java) Developed at Yahoo! Used by: Facebook Hulu IBM And many more Microsoft COSMOS (Scope, based on SQL and C#) Starfish (Ruby) … 9 Footer Text
  • 10. Parallel Map & Reduce /2 Parallel execution of Map and Reduce stages Scheduling through Master/Worker pattern Runtime handles: Assigning workers to map and reduce tasks Data distribution Detects crashed workers 10
  • 11. Parallel Map & Reduce Execution 11 Map Reduce Input Output Shuffle & Sort D RE A SU T LT A
  • 12. Components in Google’s MapReduce Web Technologies 12
  • 13. Google Filesystem (GFS) Stores… Input data Intermediate results Final results …in 64MB chunks on at least three different machines Web Technologies 13 File Nodes
  • 14. Scheduling (Master/Worker) One master, many worker Input data split into M map tasks (~64MB in Size; GFS) Reduce phase partitioned into R tasks Tasks are assigned to workers dynamically Master assigns each map task to a free worker Master assigns each reducetask to a free worker Fault handling via Redundancy Master checks if Worker still alive via heart-beat Reschedules work item if worker has died Web Technologies 14
  • 15. Scheduling Example 15 Map Reduce Input Output Temp Master Assign map Assign reduce D Worker Worker RES A Worker T Worker ULT Worker A
  • 16. Googles M&R vsHadoop Google MapReduce Main language: C++ Google Filesystem (GFS) GFS Master GFS chunkserver HadoopMapReduce Main language: Java HadoopFilesystem (HDFS) Hadoopnamenode Hadoopdatanode Web Technologies 16
  • 17. Word Count The Map & Reduce “Hello World” example 17
  • 18. Word Count - Input Set of text files: Expected Output: sweet (1), this (2), is (2), the (2), foo (1), bar (1), file (1) 18 bar.txt This is the bar file foo.txt Sweet, this is the foo file
  • 19. Word Count - Map Mapper(filename, file-contents): for each word emit(word,1) Output this (1) is (1) the (1) sweet (1) this (1) the (1) is (1) foo (1) bar (1) file (1) 19
  • 20. Word Count – Shuffle Sort this (1) is (1) the (1) sweet (1) this (1) the (1) is (1) foo (1) bar (1) file (1) this (1) this (1) is (1) is (1) the (1) the (1) sweet (1) foo (1) bar (1) file (1) 20
  • 21. Word Count - Reduce reducer(word, values): sum = 0 for each value in values: sum = sum + value emit(word,sum) Output sweet (1) this (2) is (2) the (2) foo (1) bar (1) file (1) 21
  • 22. DEMO Hadoop – Word Count 22
  • 23. Summary Lots of data processed on the web (e.g., Google) Performance solution: Go parallel Input, Map, Shuffle & Sort, Reduce, Output Google File System Scheduling: Master/Worker Word Count example Hadoop Questions? Web Technologies 23
  • 24. References Inspirations for presentation http://www4.informatik.uni-erlangen.de/Lehre/WS10/V_MW/Uebung/folien/05-Map-Reduce-Framework.pdf http://www.scribd.com/doc/23844299/Map-Reduce-Hadoop-Pig RWTH Map Reduce Talk: http://bit.ly/f5oM7p Paper Dean et al, MapReduce: Simplified Data Processing on Large Clusters, OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004 Ghemawat et al, The Google File System, 19th ACM Symposium on Operating Systems Principles, Lake George, NY, October, 2003. 24

Hinweis der Redaktion

  1. In these days the web is all about data. All major and important websites relay on huge amount of data in some form in order to provide services to users. For example Google … and Facebook …. Also facilities like the LHC will produce data measures in peta bytes each year. However, it takes about 2.5 hours in order to read one terabyte off a typical hard drive. The solution that comes immediately to mind, of course, is going parallel. KonkretesBeispiel [TODO], [Kontextzu Cloud Computing]
  2. Parallel programming is still hard. Programmers have to deal with a lot of boilerplate code and have to manually write code for things like scheduling and load balancing. Also people want to use the company cluster in parallel, so something like a batch system is needed. As more and more companies use huge amounts of data, a some kind of standard framework or platform has emerged in recent years and that is the Map/Reduce framework.
  3. Map Reduce known for years as functional programming concept
  4. Actual execution and scheduling
  5. http://www4.informatik.uni-erlangen.de/Lehre/WS10/V_MW/Uebung/folien/05-Map-Reduce-Framework.pdf