SlideShare ist ein Scribd-Unternehmen logo
1 von 22
Downloaden Sie, um offline zu lesen
VanPyz, June 2, 2009




Introduction to MapReduce
using Disco
Erlang and Python


by @JimRoepcke



                                       1
Computing at Google Scale
                                Image Source: http://ischool.tv/news/files/2006/12/computer-grid02s.jpg



Massive databases and data
streams need to be processed
quickly and reliably
Thousands of commodity PCs
available in Google’s cluster
for computations
Faults are statistically
“guaranteed” to occur

                                                                                                         2
Google’s Motivation

Google has thousands of programs to process user-
generated data
Even simple computations were being obscured by the
complex code required to run efficiently and reliably on
their clusters.
Engineers shouldn’t have to be experts in distributed
systems to write scalable data-processing software.


                                                          3
Why not just use threads?

Threads only add concurrency, only on one node
Does not scale to > 1 node, a cluster, or a cloud
Coordinating work between nodes requires distribution
middleware
MapReduce is distribution middleware
MapReduce scales linearly with cores / nodes


                                                        4
Hadoop


Apache Foundation project

Written in Java

Includes the Hadoop Distributed File System




                                              5
Disco

Created by Ville Tuulos of the Nokia Research Center

Written in Erlang and Python

Does not include a distributed File System

  Provide your own data distribution mechanism



                                                       6
How MapReduce works



                      7
The big scary diagram...
Source: http://labs.google.com/papers/mapreduce-osdi04.pdf




                                                    User
                                                  Program
                                  (1) fork
                                                  (1) fork          (1) fork



                                                   Master
                                                                        (2)
                                         (2)                          assign
                                       assign                         reduce
                                        map

                     worker
split 0                                                                                             (6) write
                                                                                                                output
split 1                                                                                    worker               file 0
                                                             (5) remote read
split 2   (3) read            (4) local write
                     worker                                                                                     output
split 3                                                                                    worker
                                                                                                                file 1
split 4


                     worker


Input                Map                   Intermediate files                              Reduce               Output
 files               phase                  (on local disks)                                phase                files

                                                                                                                         9

                                 Figure 1: Execution overview
It’s truly very simple...
Master splits input


 The (typically huge) input is split into chunks

   One or more for each “map worker”




                                                   11
Splits fed to map workers

The master tells each map worker which split(s) it will
process

  A split is a file containing some number of input
  records

  Each record has a key and its associated value



                                                          12
Map each input


The map worker executes your problem-specific map
algorithm

  Called once for each record in its input




                                                   13
Map emits (Key,Value) pairs

 Your map algorithm emits zero or more intermediate
 key-value pairs for each record processed

   Let’s call these “(K,V) pairs” from now on

   Keys and values are both strings




                                                      14
(K,V) Pairs hashed to buckets
 Each map worker has its own set of buckets

 Each (K,V) pair is placed into one of these buckets

 Which bucket is determined by a hash function


 Advanced: if you know the distribution of your
 intermediate keys is skewed, provide a custom hash
 function that distributes (K,V) pairs evenly

                                                       15
Buckets sent to Reducers
Once all map workers are finished, corresponding
buckets of (K,V) pairs are sent to reduce workers

Example: Each map worker placed (K,V) pairs into its
own buckets A, B, and C.

Send bucket A from each map to reduce worker 1;
Send bucket B from each map to reduce worker 2;
Send bucket C from each map to reduce worker 3.

                                                       16
Reduce inputs sorted
The reduce worker first concatenates the buckets it
received into one file

Then the file of (K,V) pairs is sorted by K

  Now the (K,V) pairs are grouped by key

This sorted list of (K,V) pairs is the input to the reduce
worker


                                                             17
Reduce the list of (K,V) pairs

 The reduce worker executes your problem-specific
 reduce algorithm

   Called once for each key in its input

   Writes whatever it wants to its output file




                                                   18
Output

The output of the MapReduce job is the set of output
files generated by the reduce workers

What you do with this output is up to you

You might use this output as the input to another
MapReduce job



                                                       19
Modified from source: http://labs.google.com/papers/mapreduce-osdi04.pdf




Example: Counting words
    def map (key, value):
       # key: document name (ignored)
       # value: words in document (list)
       for word in value:
               EmitIntermediate(word, “1”)
    def reduce (key, values):
       # key: a word
       # values: a list of counts
       result = 0
       for v in values:
               result += int(v)
       print key, result
                                                                                    20
Stand up! Let’s do it!


 Organize yourselves into approximately equal numbers
 of map and reduce workers

 I’ll be the master
Disco demonstration
Wanted to demonstrate a cool
puzzle solver.

No go, but I can show the code.
It’s really simple!

Instead you get count_words again,
but scaled way up!

python count_words.py
disco://localhost

Weitere ähnliche Inhalte

Was ist angesagt?

Oit And Indirect Illumination Using Dx11 Linked Lists
Oit And Indirect Illumination Using Dx11 Linked ListsOit And Indirect Illumination Using Dx11 Linked Lists
Oit And Indirect Illumination Using Dx11 Linked ListsHolger Gruen
 
Эдуард Бортников «Предсказание «узких мест» при выполнении команд в кластерах...
Эдуард Бортников «Предсказание «узких мест» при выполнении команд в кластерах...Эдуард Бортников «Предсказание «узких мест» при выполнении команд в кластерах...
Эдуард Бортников «Предсказание «узких мест» при выполнении команд в кластерах...Yandex
 
Hadoop_Pennonsoft
Hadoop_PennonsoftHadoop_Pennonsoft
Hadoop_PennonsoftPennonSoft
 
MapReduce: teoria e prática
MapReduce: teoria e práticaMapReduce: teoria e prática
MapReduce: teoria e práticaPET Computação
 
Chap5 - ADSP 21K Manual
Chap5 - ADSP 21K ManualChap5 - ADSP 21K Manual
Chap5 - ADSP 21K ManualSethCopeland
 
Hadoop institutes in Bangalore
Hadoop institutes in BangaloreHadoop institutes in Bangalore
Hadoop institutes in Bangaloresrikanthhadoop
 
Memory Management C++ (Peeling operator new() and delete())
Memory Management C++ (Peeling operator new() and delete())Memory Management C++ (Peeling operator new() and delete())
Memory Management C++ (Peeling operator new() and delete())Sameer Rathoud
 
Seeing with Python presented at PyCon AU 2014
Seeing with Python presented at PyCon AU 2014Seeing with Python presented at PyCon AU 2014
Seeing with Python presented at PyCon AU 2014Mark Rees
 
Beginning direct3d gameprogramming09_shaderprogramming_20160505_jintaeks
Beginning direct3d gameprogramming09_shaderprogramming_20160505_jintaeksBeginning direct3d gameprogramming09_shaderprogramming_20160505_jintaeks
Beginning direct3d gameprogramming09_shaderprogramming_20160505_jintaeksJinTaek Seo
 
How data rules the world: Telemetry in Battlefield Heroes
How data rules the world: Telemetry in Battlefield HeroesHow data rules the world: Telemetry in Battlefield Heroes
How data rules the world: Telemetry in Battlefield HeroesElectronic Arts / DICE
 
Shark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleShark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleDataWorks Summit
 
Making Big Data Analytics Interactive and Real-­Time
 Making Big Data Analytics Interactive and Real-­Time Making Big Data Analytics Interactive and Real-­Time
Making Big Data Analytics Interactive and Real-­TimeSeven Nguyen
 
Actor, an elegant model for concurrent and distributed computation
Actor, an elegant model for concurrent and distributed computationActor, an elegant model for concurrent and distributed computation
Actor, an elegant model for concurrent and distributed computationAlessio Coltellacci
 

Was ist angesagt? (20)

Oit And Indirect Illumination Using Dx11 Linked Lists
Oit And Indirect Illumination Using Dx11 Linked ListsOit And Indirect Illumination Using Dx11 Linked Lists
Oit And Indirect Illumination Using Dx11 Linked Lists
 
Hadoop 3
Hadoop 3Hadoop 3
Hadoop 3
 
Python
PythonPython
Python
 
Эдуард Бортников «Предсказание «узких мест» при выполнении команд в кластерах...
Эдуард Бортников «Предсказание «узких мест» при выполнении команд в кластерах...Эдуард Бортников «Предсказание «узких мест» при выполнении команд в кластерах...
Эдуард Бортников «Предсказание «узких мест» при выполнении команд в кластерах...
 
Hadoop_Pennonsoft
Hadoop_PennonsoftHadoop_Pennonsoft
Hadoop_Pennonsoft
 
MapReduce: teoria e prática
MapReduce: teoria e práticaMapReduce: teoria e prática
MapReduce: teoria e prática
 
Chap5 - ADSP 21K Manual
Chap5 - ADSP 21K ManualChap5 - ADSP 21K Manual
Chap5 - ADSP 21K Manual
 
Exp 03
Exp 03Exp 03
Exp 03
 
Ns2
Ns2Ns2
Ns2
 
Hadoop institutes in Bangalore
Hadoop institutes in BangaloreHadoop institutes in Bangalore
Hadoop institutes in Bangalore
 
Memory Management C++ (Peeling operator new() and delete())
Memory Management C++ (Peeling operator new() and delete())Memory Management C++ (Peeling operator new() and delete())
Memory Management C++ (Peeling operator new() and delete())
 
Seeing with Python presented at PyCon AU 2014
Seeing with Python presented at PyCon AU 2014Seeing with Python presented at PyCon AU 2014
Seeing with Python presented at PyCon AU 2014
 
Beginning direct3d gameprogramming09_shaderprogramming_20160505_jintaeks
Beginning direct3d gameprogramming09_shaderprogramming_20160505_jintaeksBeginning direct3d gameprogramming09_shaderprogramming_20160505_jintaeks
Beginning direct3d gameprogramming09_shaderprogramming_20160505_jintaeks
 
How data rules the world: Telemetry in Battlefield Heroes
How data rules the world: Telemetry in Battlefield HeroesHow data rules the world: Telemetry in Battlefield Heroes
How data rules the world: Telemetry in Battlefield Heroes
 
Survey onhpcs languages
Survey onhpcs languagesSurvey onhpcs languages
Survey onhpcs languages
 
Shark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleShark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at Scale
 
Making Big Data Analytics Interactive and Real-­Time
 Making Big Data Analytics Interactive and Real-­Time Making Big Data Analytics Interactive and Real-­Time
Making Big Data Analytics Interactive and Real-­Time
 
Tut hemant ns2
Tut hemant ns2Tut hemant ns2
Tut hemant ns2
 
Actor, an elegant model for concurrent and distributed computation
Actor, an elegant model for concurrent and distributed computationActor, an elegant model for concurrent and distributed computation
Actor, an elegant model for concurrent and distributed computation
 
Ns2
Ns2Ns2
Ns2
 

Andere mochten auch

Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduceHassan A-j
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduceFrane Bandov
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-ReduceBrendan Tierney
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsSkillspeed
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map ReduceApache Apex
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopMohamed Elsaka
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Cloudera, Inc.
 

Andere mochten auch (9)

Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2
 

Ähnlich wie Introduction to MapReduce using Disco

Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview questionpappupassindia
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxHARIKRISHNANU13
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopApache Apex
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
Apache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupApache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupCloudera, Inc.
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduceM Baddar
 
Aws dc elastic-mapreduce
Aws dc elastic-mapreduceAws dc elastic-mapreduce
Aws dc elastic-mapreducebeaknit
 
Aws dc elastic-mapreduce
Aws dc elastic-mapreduceAws dc elastic-mapreduce
Aws dc elastic-mapreducebeaknit
 

Ähnlich wie Introduction to MapReduce using Disco (20)

Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
ch02-mapreduce.pptx
ch02-mapreduce.pptxch02-mapreduce.pptx
ch02-mapreduce.pptx
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
mapReduce.pptx
mapReduce.pptxmapReduce.pptx
mapReduce.pptx
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
MapReduce
MapReduceMapReduce
MapReduce
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Hadoop
HadoopHadoop
Hadoop
 
Apache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupApache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's Group
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Aws dc elastic-mapreduce
Aws dc elastic-mapreduceAws dc elastic-mapreduce
Aws dc elastic-mapreduce
 
Aws dc elastic-mapreduce
Aws dc elastic-mapreduceAws dc elastic-mapreduce
Aws dc elastic-mapreduce
 

Introduction to MapReduce using Disco

  • 1. VanPyz, June 2, 2009 Introduction to MapReduce using Disco Erlang and Python by @JimRoepcke 1
  • 2. Computing at Google Scale Image Source: http://ischool.tv/news/files/2006/12/computer-grid02s.jpg Massive databases and data streams need to be processed quickly and reliably Thousands of commodity PCs available in Google’s cluster for computations Faults are statistically “guaranteed” to occur 2
  • 3. Google’s Motivation Google has thousands of programs to process user- generated data Even simple computations were being obscured by the complex code required to run efficiently and reliably on their clusters. Engineers shouldn’t have to be experts in distributed systems to write scalable data-processing software. 3
  • 4. Why not just use threads? Threads only add concurrency, only on one node Does not scale to > 1 node, a cluster, or a cloud Coordinating work between nodes requires distribution middleware MapReduce is distribution middleware MapReduce scales linearly with cores / nodes 4
  • 5. Hadoop Apache Foundation project Written in Java Includes the Hadoop Distributed File System 5
  • 6. Disco Created by Ville Tuulos of the Nokia Research Center Written in Erlang and Python Does not include a distributed File System Provide your own data distribution mechanism 6
  • 8. The big scary diagram...
  • 9. Source: http://labs.google.com/papers/mapreduce-osdi04.pdf User Program (1) fork (1) fork (1) fork Master (2) (2) assign assign reduce map worker split 0 (6) write output split 1 worker file 0 (5) remote read split 2 (3) read (4) local write worker output split 3 worker file 1 split 4 worker Input Map Intermediate files Reduce Output files phase (on local disks) phase files 9 Figure 1: Execution overview
  • 10. It’s truly very simple...
  • 11. Master splits input The (typically huge) input is split into chunks One or more for each “map worker” 11
  • 12. Splits fed to map workers The master tells each map worker which split(s) it will process A split is a file containing some number of input records Each record has a key and its associated value 12
  • 13. Map each input The map worker executes your problem-specific map algorithm Called once for each record in its input 13
  • 14. Map emits (Key,Value) pairs Your map algorithm emits zero or more intermediate key-value pairs for each record processed Let’s call these “(K,V) pairs” from now on Keys and values are both strings 14
  • 15. (K,V) Pairs hashed to buckets Each map worker has its own set of buckets Each (K,V) pair is placed into one of these buckets Which bucket is determined by a hash function Advanced: if you know the distribution of your intermediate keys is skewed, provide a custom hash function that distributes (K,V) pairs evenly 15
  • 16. Buckets sent to Reducers Once all map workers are finished, corresponding buckets of (K,V) pairs are sent to reduce workers Example: Each map worker placed (K,V) pairs into its own buckets A, B, and C. Send bucket A from each map to reduce worker 1; Send bucket B from each map to reduce worker 2; Send bucket C from each map to reduce worker 3. 16
  • 17. Reduce inputs sorted The reduce worker first concatenates the buckets it received into one file Then the file of (K,V) pairs is sorted by K Now the (K,V) pairs are grouped by key This sorted list of (K,V) pairs is the input to the reduce worker 17
  • 18. Reduce the list of (K,V) pairs The reduce worker executes your problem-specific reduce algorithm Called once for each key in its input Writes whatever it wants to its output file 18
  • 19. Output The output of the MapReduce job is the set of output files generated by the reduce workers What you do with this output is up to you You might use this output as the input to another MapReduce job 19
  • 20. Modified from source: http://labs.google.com/papers/mapreduce-osdi04.pdf Example: Counting words def map (key, value): # key: document name (ignored) # value: words in document (list) for word in value: EmitIntermediate(word, “1”) def reduce (key, values): # key: a word # values: a list of counts result = 0 for v in values: result += int(v) print key, result 20
  • 21. Stand up! Let’s do it! Organize yourselves into approximately equal numbers of map and reduce workers I’ll be the master
  • 22. Disco demonstration Wanted to demonstrate a cool puzzle solver. No go, but I can show the code. It’s really simple! Instead you get count_words again, but scaled way up! python count_words.py disco://localhost