SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Downloaden Sie, um offline zu lesen
Map/Reduce


   Based on original presentation by
Christophe Bisciglia, Aaron Kimball, and
        Sierra Michels-Slettvet

  Except as otherwise noted, the content
 of this presentation is licensed under the
Creative Commons Attribution 2.5 License.
Functional Programming Review

   Functional operations do not modify data structures: They
    always create new ones
   Original data still exists in unmodified form
   Data flows are implicit in program design
   Order of operations does not matter
Functional Programming Review

fun foo(l: int list) =
 sum(l) + mul(l) + length(l)

 Order of sum() and mul(), etc does not matter – they do not
 modify l
Functional Updates Do Not Modify Structures

fun append(x, lst) =
 let lst' = reverse lst in
   reverse ( x :: lst' )


  The append() function above reverses a list, adds a new
  element to the front, and returns all of that, reversed,
  which appends an item.


  But it never modifies lst!
Functions Can Be Used As Arguments

fun DoDouble(f, x) = f (f x)

It does not matter what f does to its
argument; DoDouble() will do it twice.


What is the type of this function?
Map
map f a [] = f(a)
map f (a:as) = list(f(a), map(f, as))
 Creates a new list by applying f to each element of the input list; returns
  output in order. f


                         f


                               f


                                     f


                                          f


                                                f
Fold

fold f x0 lst: ('a*'b->'b)->'b->('a list)->'b
  Moves across a list, applying f to each element plus an accumulator. f
   returns the next accumulator value, which is combined with the next
   element of the list




                           f    f      f    f     f   returned
                 initial
fold left vs. fold right

   Order of list elements can be significant
   Fold left moves left-to-right across the list
   Fold right moves from right-to-left


SML Implementation:

fun foldl f a []      = a
  | foldl f a (x::xs) = foldl f (f(x, a)) xs

fun foldr f a []      = a
  | foldr f a (x::xs) = f(x, (foldr f a xs))
Example

fun foo(l: int list) =
 sum(l) + mul(l) + length(l)

How can we implement this?
Example (Solved)

fun foo(l: int list) =
 sum(l) + mul(l) + length(l)

fun sum(lst) = foldl (fn (x,a)=>x+a) 0 lst
fun mul(lst) = foldl (fn (x,a)=>x*a) 1 lst
fun length(lst) = foldl (fn (x,a)=>1+a) 0 lst
A More Complicated Fold Problem

   Given a list of numbers, how can we generate a list of partial
    sums?

e.g.: [1, 4, 8, 3, 7, 9] 
    [0, 1, 5, 13, 16, 23, 32]
A More Complicated Fold Problem

   Given a list of numbers, how can we generate a list of partial
    sums?

e.g.: [1, 4, 8, 3, 7, 9] 
    [0, 1, 5, 13, 16, 23, 32]
fun partialsum(lst) = foldl(fn(x,a) => list(a (last(a) + x))) 0 lst
A More Complicated Map Problem

   Given a list of words, can we: reverse the letters in each word,
    and reverse the whole list, so it all comes out backwards?

[“my”, “happy”, “cat”] -> [“tac”, “yppah”, “ym”]
A More Complicated Map Problem

   Given a list of words, can we: reverse the letters in each word,
    and reverse the whole list, so it all comes out backwards?

[“my”, “happy”, “cat”] -> [“tac”, “yppah”, “ym”]
fun reverse2(lst) = foldr(fn(x,a)=>list(a, reverseword(x)) [] lst
map Implementation
fun map f []      = []
  | map f (x::xs) = (f x) :: (map f xs)


   This implementation moves left-to-right across the list,
    mapping elements one at a time

   … But does it need to?
Implicit Parallelism In Map

   In a purely functional setting, elements of a list being computed by map
    cannot see the effects of the computations on other elements
   If order of application of f to elements in list is commutative, we can
    reorder or parallelize execution
   This is the insight behind MapReduce
Motivation: Large Scale Data Processing

   Want to process lots of data ( > 1 TB)
   Want to parallelize across hundreds/thousands of CPUs
   … Want to make this easy
    • Hide the details of parallelism, machine management, fault
      tolerance, etc.
Sample Applications
   Distributed Greo
   Count of URL Access Frequency
   Reverse Web-Lijk Graph
   Inverted Index
   Distributed Sort
MapReduce

   Automatic parallelization & distribution
   Fault-tolerant
   Provides status and monitoring tools
   Clean abstraction for programmers
Programming Model

   Borrows from functional programming
   Users implement interface of two functions:

    • map (in_key, in_value) ->
       (out_key, intermediate_value) list

    • reduce (out_key, intermediate_value list) ->
       out_value list
Map

   Records from the data source (lines out of files, rows of a
    database, etc) are fed into the map function as key*value pairs:
    e.g., (filename, line)
   map() produces one or more intermediate values along with an
    output key from the input
   Buffers intermediate values in memory before periodically
    writing to local disk
   Writes are split into R regions based on intermediate key value
    (e.g., hash(key) mod R)
     • Locations of regions communicated back to master who informs
       reduce tasks of all appropriate disk locations
Reduce

   After the map phase is over, all the intermediate values for a
    given output key are combined together into a list
     • RPC over GFS to gather all the keys for a given region
     • Sort all keys since the same key can in general come from
       multiple map processes
   reduce() combines those intermediate values into one or more
    final values for that same output key
   Optional combine() phase as an optimization
Input key *value                                                    Input key *value
                     pairs                                                               pairs



                                              ...

                                  map                                                                  map
Data store 1                                                        Data store n




                (key 1,          (key 2,       (key 3,                               (key 1,            (key 2,       (key 3,
               values ...)      values ...)   values ...)                           values ...)        values ...)   values ...)



                        == Barrier == : Aggregates intermediate values by output key

                                   key 1,                              key 2,                                 key 3,
                               intermediate                        intermediate                           intermediate
                                   values                              values                                 values


                             reduce                           reduce                                  reduce




                       final key 1                          final key 2                           final key 3
                          values                               values                                values
Parallelism

   map() functions run in parallel, creating different intermediate
    values from different input data sets
   reduce() functions also run in parallel, each working on a
    different output key
   All values are processed independently
   Bottleneck: reduce phase cannot start until map phase
    completes
Example: Count word occurrences
map(String input_key, String input_value):
  // input_key: document name
  // input_value: document contents
 for each word w in input_value:
   EmitIntermediate(w, "1");


reduce(String output_key, Iterator
  intermediate_values):
  // output_key: a word
  // output_values: a list of counts
 int result = 0;
 for each v in intermediate_values:
    result += ParseInt(v);
Emit(AsString(result));
Example vs. Actual Source Code

   Example is written in pseudo-code
   Actual implementation is in C++, using a MapReduce library
   Bindings for Python and Java exist via interfaces
   True code is somewhat more involved (defines how the input
    key/values are divided up and accessed, etc.)
Implementation
Locality

   Master program divides up tasks based on location of data:
    tries to have map() tasks on same machine as physical file data
    • Failing that, on the same switch where bandwidth is relatively
      plentiful
    • Datacenter communications architecture?
   map() task inputs are divided into 64 MB blocks: same size as
    Google File System chunks
Fault Tolerance

   Master detects worker failures
    • Re-executes completed & in-progress map() tasks
    • Re-executes in-progress reduce() tasks
    • Importance of deterministic operations
   Data written to temporary files by both map() and reduce()
    • Upon successful completion, map() tells master of file names
        Master ignores if already heard from another map on same task
    • Upon successful completion, reduce() atomically renames file
   Master notices particular input key/values cause crashes in
    map(), and skips those values on re-execution.
    • Effect: Can work around bugs in third-party libraries
Optimizations

   No reduce can start until map is complete:
     • A single slow disk controller can rate-limit the whole process
   Master redundantly executes “slow-moving” map tasks; uses
    results of first copy to finish




    Why is it safe to redundantly execute map tasks? Wouldn’t this mess up
    the total computation?
Optimizations

   “Combiner” functions can run on same machine as a mapper
   Causes a mini-reduce phase to occur before the real reduce
    phase, to save bandwidth
Performance Evaluation
   1800 Machines
    • Gigabit Ethernet Switches
    • Two level tree hierarchy, 100-200 Gbps at root
    • 4GB RAM, 2Ghz dual Intel processors
    • Two 160GB drives
   Grep: 10^10 100 byte records (1 TB)
    • Search for relatively rare 3-character sequence
   Sort: 10^10 100 byte records (1 TB)
Grep Data Transfer Rate
Sort: Normal Execution
Sort: No Backup Tasks
MapReduce Conclusions

   MapReduce has proven to be a useful abstraction
   Greatly simplifies large-scale computations at Google
   Functional programming paradigm can be applied to large-scale
    applications
   Focus on problem, let library deal w/ messy details

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to Monads in Scala (1)
Introduction to Monads in Scala (1)Introduction to Monads in Scala (1)
Introduction to Monads in Scala (1)
stasimus
 

Was ist angesagt? (20)

A complete introduction on matlab and matlab's projects
A complete introduction on matlab and matlab's projectsA complete introduction on matlab and matlab's projects
A complete introduction on matlab and matlab's projects
 
Language R
Language RLanguage R
Language R
 
Haskell for data science
Haskell for data scienceHaskell for data science
Haskell for data science
 
Functional Programming
Functional ProgrammingFunctional Programming
Functional Programming
 
Programming in R
Programming in RProgramming in R
Programming in R
 
01. haskell introduction
01. haskell introduction01. haskell introduction
01. haskell introduction
 
R Basics
R BasicsR Basics
R Basics
 
N-Queens Combinatorial Problem - Polyglot FP for fun and profit - Haskell and...
N-Queens Combinatorial Problem - Polyglot FP for fun and profit - Haskell and...N-Queens Combinatorial Problem - Polyglot FP for fun and profit - Haskell and...
N-Queens Combinatorial Problem - Polyglot FP for fun and profit - Haskell and...
 
Quicksort - a whistle-stop tour of the algorithm in five languages and four p...
Quicksort - a whistle-stop tour of the algorithm in five languages and four p...Quicksort - a whistle-stop tour of the algorithm in five languages and four p...
Quicksort - a whistle-stop tour of the algorithm in five languages and four p...
 
R Language Introduction
R Language IntroductionR Language Introduction
R Language Introduction
 
Sql server lab_2
Sql server lab_2Sql server lab_2
Sql server lab_2
 
Introduction to Monads in Scala (1)
Introduction to Monads in Scala (1)Introduction to Monads in Scala (1)
Introduction to Monads in Scala (1)
 
BasicGraphsWithR
BasicGraphsWithRBasicGraphsWithR
BasicGraphsWithR
 
Data import-cheatsheet
Data import-cheatsheetData import-cheatsheet
Data import-cheatsheet
 
Functional programming with haskell
Functional programming with haskellFunctional programming with haskell
Functional programming with haskell
 
What is matlab
What is matlabWhat is matlab
What is matlab
 
R Workshop for Beginners
R Workshop for BeginnersR Workshop for Beginners
R Workshop for Beginners
 
18. Java associative arrays
18. Java associative arrays18. Java associative arrays
18. Java associative arrays
 
Lecture notesmap
Lecture notesmapLecture notesmap
Lecture notesmap
 
Scala collections
Scala collectionsScala collections
Scala collections
 

Ähnlich wie Map Reduce

Map reduce (from Google)
Map reduce (from Google)Map reduce (from Google)
Map reduce (from Google)
Sri Prasanna
 
Mapreduce: Theory and implementation
Mapreduce: Theory and implementationMapreduce: Theory and implementation
Mapreduce: Theory and implementation
Sri Prasanna
 
Distributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
Distributed Computing Seminar - Lecture 2: MapReduce Theory and ImplementationDistributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
Distributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
tugrulh
 
Parallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-JoinsParallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-Joins
Jonny Daenen
 
Five Languages in a Moment
Five Languages in a MomentFive Languages in a Moment
Five Languages in a Moment
Sergio Gil
 
โปรแกรมย่อยและฟังชั่นมาตรฐาน ม.6 1
โปรแกรมย่อยและฟังชั่นมาตรฐาน ม.6 1โปรแกรมย่อยและฟังชั่นมาตรฐาน ม.6 1
โปรแกรมย่อยและฟังชั่นมาตรฐาน ม.6 1
Little Tukta Lita
 
An important part of electrical engineering is PCB design. One impor.pdf
An important part of electrical engineering is PCB design. One impor.pdfAn important part of electrical engineering is PCB design. One impor.pdf
An important part of electrical engineering is PCB design. One impor.pdf
ARORACOCKERY2111
 

Ähnlich wie Map Reduce (20)

Map reduce (from Google)
Map reduce (from Google)Map reduce (from Google)
Map reduce (from Google)
 
Mapreduce: Theory and implementation
Mapreduce: Theory and implementationMapreduce: Theory and implementation
Mapreduce: Theory and implementation
 
Lec2 Mapred
Lec2 MapredLec2 Mapred
Lec2 Mapred
 
Distributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
Distributed Computing Seminar - Lecture 2: MapReduce Theory and ImplementationDistributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
Distributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
 
Big data shim
Big data shimBig data shim
Big data shim
 
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
 
R Cheat Sheet for Data Analysts and Statisticians.pdf
R Cheat Sheet for Data Analysts and Statisticians.pdfR Cheat Sheet for Data Analysts and Statisticians.pdf
R Cheat Sheet for Data Analysts and Statisticians.pdf
 
Parallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-JoinsParallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-Joins
 
purrr.pdf
purrr.pdfpurrr.pdf
purrr.pdf
 
Five Languages in a Moment
Five Languages in a MomentFive Languages in a Moment
Five Languages in a Moment
 
โปรแกรมย่อยและฟังชั่นมาตรฐาน ม.6 1
โปรแกรมย่อยและฟังชั่นมาตรฐาน ม.6 1โปรแกรมย่อยและฟังชั่นมาตรฐาน ม.6 1
โปรแกรมย่อยและฟังชั่นมาตรฐาน ม.6 1
 
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
 
Rcommands-for those who interested in R.
Rcommands-for those who interested in R.Rcommands-for those who interested in R.
Rcommands-for those who interested in R.
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query Language
 
Hw09 Hadoop + Clojure
Hw09   Hadoop + ClojureHw09   Hadoop + Clojure
Hw09 Hadoop + Clojure
 
A Proposition for Business Process Modeling
A Proposition for Business Process ModelingA Proposition for Business Process Modeling
A Proposition for Business Process Modeling
 
EX-6-Implement Matrix Multiplication with Hadoop Map Reduce.pptx
EX-6-Implement Matrix Multiplication with Hadoop Map Reduce.pptxEX-6-Implement Matrix Multiplication with Hadoop Map Reduce.pptx
EX-6-Implement Matrix Multiplication with Hadoop Map Reduce.pptx
 
An important part of electrical engineering is PCB design. One impor.pdf
An important part of electrical engineering is PCB design. One impor.pdfAn important part of electrical engineering is PCB design. One impor.pdf
An important part of electrical engineering is PCB design. One impor.pdf
 
Is Haskell an acceptable Perl?
Is Haskell an acceptable Perl?Is Haskell an acceptable Perl?
Is Haskell an acceptable Perl?
 
Hadoop + Clojure
Hadoop + ClojureHadoop + Clojure
Hadoop + Clojure
 

Mehr von Sri Prasanna

Mehr von Sri Prasanna (20)

Qr codes para tech radar
Qr codes para tech radarQr codes para tech radar
Qr codes para tech radar
 
Qr codes para tech radar 2
Qr codes para tech radar 2Qr codes para tech radar 2
Qr codes para tech radar 2
 
Test
TestTest
Test
 
Test
TestTest
Test
 
assds
assdsassds
assds
 
assds
assdsassds
assds
 
asdsa
asdsaasdsa
asdsa
 
dsd
dsddsd
dsd
 
About stacks
About stacksAbout stacks
About stacks
 
About Stacks
About  StacksAbout  Stacks
About Stacks
 
About Stacks
About  StacksAbout  Stacks
About Stacks
 
About Stacks
About  StacksAbout  Stacks
About Stacks
 
About Stacks
About  StacksAbout  Stacks
About Stacks
 
About Stacks
About  StacksAbout  Stacks
About Stacks
 
About Stacks
About StacksAbout Stacks
About Stacks
 
About Stacks
About StacksAbout Stacks
About Stacks
 
Network and distributed systems
Network and distributed systemsNetwork and distributed systems
Network and distributed systems
 
Introduction & Parellelization on large scale clusters
Introduction & Parellelization on large scale clustersIntroduction & Parellelization on large scale clusters
Introduction & Parellelization on large scale clusters
 
Other distributed systems
Other distributed systemsOther distributed systems
Other distributed systems
 
Distributed file systems
Distributed file systemsDistributed file systems
Distributed file systems
 

Kürzlich hochgeladen

1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 

Kürzlich hochgeladen (20)

Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Magic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxMagic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptx
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
Third Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptxThird Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptx
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 

Map Reduce

  • 1. Map/Reduce Based on original presentation by Christophe Bisciglia, Aaron Kimball, and Sierra Michels-Slettvet Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License.
  • 2. Functional Programming Review  Functional operations do not modify data structures: They always create new ones  Original data still exists in unmodified form  Data flows are implicit in program design  Order of operations does not matter
  • 3. Functional Programming Review fun foo(l: int list) = sum(l) + mul(l) + length(l) Order of sum() and mul(), etc does not matter – they do not modify l
  • 4. Functional Updates Do Not Modify Structures fun append(x, lst) = let lst' = reverse lst in reverse ( x :: lst' ) The append() function above reverses a list, adds a new element to the front, and returns all of that, reversed, which appends an item. But it never modifies lst!
  • 5. Functions Can Be Used As Arguments fun DoDouble(f, x) = f (f x) It does not matter what f does to its argument; DoDouble() will do it twice. What is the type of this function?
  • 6. Map map f a [] = f(a) map f (a:as) = list(f(a), map(f, as)) Creates a new list by applying f to each element of the input list; returns output in order. f f f f f f
  • 7. Fold fold f x0 lst: ('a*'b->'b)->'b->('a list)->'b Moves across a list, applying f to each element plus an accumulator. f returns the next accumulator value, which is combined with the next element of the list f f f f f returned initial
  • 8. fold left vs. fold right  Order of list elements can be significant  Fold left moves left-to-right across the list  Fold right moves from right-to-left SML Implementation: fun foldl f a [] = a | foldl f a (x::xs) = foldl f (f(x, a)) xs fun foldr f a [] = a | foldr f a (x::xs) = f(x, (foldr f a xs))
  • 9. Example fun foo(l: int list) = sum(l) + mul(l) + length(l) How can we implement this?
  • 10. Example (Solved) fun foo(l: int list) = sum(l) + mul(l) + length(l) fun sum(lst) = foldl (fn (x,a)=>x+a) 0 lst fun mul(lst) = foldl (fn (x,a)=>x*a) 1 lst fun length(lst) = foldl (fn (x,a)=>1+a) 0 lst
  • 11. A More Complicated Fold Problem  Given a list of numbers, how can we generate a list of partial sums? e.g.: [1, 4, 8, 3, 7, 9]  [0, 1, 5, 13, 16, 23, 32]
  • 12. A More Complicated Fold Problem  Given a list of numbers, how can we generate a list of partial sums? e.g.: [1, 4, 8, 3, 7, 9]  [0, 1, 5, 13, 16, 23, 32] fun partialsum(lst) = foldl(fn(x,a) => list(a (last(a) + x))) 0 lst
  • 13. A More Complicated Map Problem  Given a list of words, can we: reverse the letters in each word, and reverse the whole list, so it all comes out backwards? [“my”, “happy”, “cat”] -> [“tac”, “yppah”, “ym”]
  • 14. A More Complicated Map Problem  Given a list of words, can we: reverse the letters in each word, and reverse the whole list, so it all comes out backwards? [“my”, “happy”, “cat”] -> [“tac”, “yppah”, “ym”] fun reverse2(lst) = foldr(fn(x,a)=>list(a, reverseword(x)) [] lst
  • 15. map Implementation fun map f [] = [] | map f (x::xs) = (f x) :: (map f xs)  This implementation moves left-to-right across the list, mapping elements one at a time  … But does it need to?
  • 16. Implicit Parallelism In Map  In a purely functional setting, elements of a list being computed by map cannot see the effects of the computations on other elements  If order of application of f to elements in list is commutative, we can reorder or parallelize execution  This is the insight behind MapReduce
  • 17. Motivation: Large Scale Data Processing  Want to process lots of data ( > 1 TB)  Want to parallelize across hundreds/thousands of CPUs  … Want to make this easy • Hide the details of parallelism, machine management, fault tolerance, etc.
  • 18. Sample Applications  Distributed Greo  Count of URL Access Frequency  Reverse Web-Lijk Graph  Inverted Index  Distributed Sort
  • 19. MapReduce  Automatic parallelization & distribution  Fault-tolerant  Provides status and monitoring tools  Clean abstraction for programmers
  • 20. Programming Model  Borrows from functional programming  Users implement interface of two functions: • map (in_key, in_value) -> (out_key, intermediate_value) list • reduce (out_key, intermediate_value list) -> out_value list
  • 21. Map  Records from the data source (lines out of files, rows of a database, etc) are fed into the map function as key*value pairs: e.g., (filename, line)  map() produces one or more intermediate values along with an output key from the input  Buffers intermediate values in memory before periodically writing to local disk  Writes are split into R regions based on intermediate key value (e.g., hash(key) mod R) • Locations of regions communicated back to master who informs reduce tasks of all appropriate disk locations
  • 22. Reduce  After the map phase is over, all the intermediate values for a given output key are combined together into a list • RPC over GFS to gather all the keys for a given region • Sort all keys since the same key can in general come from multiple map processes  reduce() combines those intermediate values into one or more final values for that same output key  Optional combine() phase as an optimization
  • 23. Input key *value Input key *value pairs pairs ... map map Data store 1 Data store n (key 1, (key 2, (key 3, (key 1, (key 2, (key 3, values ...) values ...) values ...) values ...) values ...) values ...) == Barrier == : Aggregates intermediate values by output key key 1, key 2, key 3, intermediate intermediate intermediate values values values reduce reduce reduce final key 1 final key 2 final key 3 values values values
  • 24. Parallelism  map() functions run in parallel, creating different intermediate values from different input data sets  reduce() functions also run in parallel, each working on a different output key  All values are processed independently  Bottleneck: reduce phase cannot start until map phase completes
  • 25. Example: Count word occurrences map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));
  • 26. Example vs. Actual Source Code  Example is written in pseudo-code  Actual implementation is in C++, using a MapReduce library  Bindings for Python and Java exist via interfaces  True code is somewhat more involved (defines how the input key/values are divided up and accessed, etc.)
  • 28. Locality  Master program divides up tasks based on location of data: tries to have map() tasks on same machine as physical file data • Failing that, on the same switch where bandwidth is relatively plentiful • Datacenter communications architecture?  map() task inputs are divided into 64 MB blocks: same size as Google File System chunks
  • 29. Fault Tolerance  Master detects worker failures • Re-executes completed & in-progress map() tasks • Re-executes in-progress reduce() tasks • Importance of deterministic operations  Data written to temporary files by both map() and reduce() • Upon successful completion, map() tells master of file names Master ignores if already heard from another map on same task • Upon successful completion, reduce() atomically renames file  Master notices particular input key/values cause crashes in map(), and skips those values on re-execution. • Effect: Can work around bugs in third-party libraries
  • 30. Optimizations  No reduce can start until map is complete: • A single slow disk controller can rate-limit the whole process  Master redundantly executes “slow-moving” map tasks; uses results of first copy to finish Why is it safe to redundantly execute map tasks? Wouldn’t this mess up the total computation?
  • 31. Optimizations  “Combiner” functions can run on same machine as a mapper  Causes a mini-reduce phase to occur before the real reduce phase, to save bandwidth
  • 32. Performance Evaluation  1800 Machines • Gigabit Ethernet Switches • Two level tree hierarchy, 100-200 Gbps at root • 4GB RAM, 2Ghz dual Intel processors • Two 160GB drives  Grep: 10^10 100 byte records (1 TB) • Search for relatively rare 3-character sequence  Sort: 10^10 100 byte records (1 TB)
  • 36. MapReduce Conclusions  MapReduce has proven to be a useful abstraction  Greatly simplifies large-scale computations at Google  Functional programming paradigm can be applied to large-scale applications  Focus on problem, let library deal w/ messy details