SlideShare a Scribd company logo
1 of 36
Download to read offline
Map/Reduce


   Based on original presentation by
Christophe Bisciglia, Aaron Kimball, and
        Sierra Michels-Slettvet

  Except as otherwise noted, the content
 of this presentation is licensed under the
Creative Commons Attribution 2.5 License.
Functional Programming Review

   Functional operations do not modify data structures: They
    always create new ones
   Original data still exists in unmodified form
   Data flows are implicit in program design
   Order of operations does not matter
Functional Programming Review

fun foo(l: int list) =
 sum(l) + mul(l) + length(l)

 Order of sum() and mul(), etc does not matter – they do not
 modify l
Functional Updates Do Not Modify Structures

fun append(x, lst) =
 let lst' = reverse lst in
   reverse ( x :: lst' )


  The append() function above reverses a list, adds a new
  element to the front, and returns all of that, reversed,
  which appends an item.


  But it never modifies lst!
Functions Can Be Used As Arguments

fun DoDouble(f, x) = f (f x)

It does not matter what f does to its
argument; DoDouble() will do it twice.


What is the type of this function?
Map
map f a [] = f(a)
map f (a:as) = list(f(a), map(f, as))
 Creates a new list by applying f to each element of the input list; returns
  output in order. f


                         f


                               f


                                     f


                                          f


                                                f
Fold

fold f x0 lst: ('a*'b->'b)->'b->('a list)->'b
  Moves across a list, applying f to each element plus an accumulator. f
   returns the next accumulator value, which is combined with the next
   element of the list




                           f    f      f    f     f   returned
                 initial
fold left vs. fold right

   Order of list elements can be significant
   Fold left moves left-to-right across the list
   Fold right moves from right-to-left


SML Implementation:

fun foldl f a []      = a
  | foldl f a (x::xs) = foldl f (f(x, a)) xs

fun foldr f a []      = a
  | foldr f a (x::xs) = f(x, (foldr f a xs))
Example

fun foo(l: int list) =
 sum(l) + mul(l) + length(l)

How can we implement this?
Example (Solved)

fun foo(l: int list) =
 sum(l) + mul(l) + length(l)

fun sum(lst) = foldl (fn (x,a)=>x+a) 0 lst
fun mul(lst) = foldl (fn (x,a)=>x*a) 1 lst
fun length(lst) = foldl (fn (x,a)=>1+a) 0 lst
A More Complicated Fold Problem

   Given a list of numbers, how can we generate a list of partial
    sums?

e.g.: [1, 4, 8, 3, 7, 9] 
    [0, 1, 5, 13, 16, 23, 32]
A More Complicated Fold Problem

   Given a list of numbers, how can we generate a list of partial
    sums?

e.g.: [1, 4, 8, 3, 7, 9] 
    [0, 1, 5, 13, 16, 23, 32]
fun partialsum(lst) = foldl(fn(x,a) => list(a (last(a) + x))) 0 lst
A More Complicated Map Problem

   Given a list of words, can we: reverse the letters in each word,
    and reverse the whole list, so it all comes out backwards?

[“my”, “happy”, “cat”] -> [“tac”, “yppah”, “ym”]
A More Complicated Map Problem

   Given a list of words, can we: reverse the letters in each word,
    and reverse the whole list, so it all comes out backwards?

[“my”, “happy”, “cat”] -> [“tac”, “yppah”, “ym”]
fun reverse2(lst) = foldr(fn(x,a)=>list(a, reverseword(x)) [] lst
map Implementation
fun map f []      = []
  | map f (x::xs) = (f x) :: (map f xs)


   This implementation moves left-to-right across the list,
    mapping elements one at a time

   … But does it need to?
Implicit Parallelism In Map

   In a purely functional setting, elements of a list being computed by map
    cannot see the effects of the computations on other elements
   If order of application of f to elements in list is commutative, we can
    reorder or parallelize execution
   This is the insight behind MapReduce
Motivation: Large Scale Data Processing

   Want to process lots of data ( > 1 TB)
   Want to parallelize across hundreds/thousands of CPUs
   … Want to make this easy
    • Hide the details of parallelism, machine management, fault
      tolerance, etc.
Sample Applications
   Distributed Greo
   Count of URL Access Frequency
   Reverse Web-Lijk Graph
   Inverted Index
   Distributed Sort
MapReduce

   Automatic parallelization & distribution
   Fault-tolerant
   Provides status and monitoring tools
   Clean abstraction for programmers
Programming Model

   Borrows from functional programming
   Users implement interface of two functions:

    • map (in_key, in_value) ->
       (out_key, intermediate_value) list

    • reduce (out_key, intermediate_value list) ->
       out_value list
Map

   Records from the data source (lines out of files, rows of a
    database, etc) are fed into the map function as key*value pairs:
    e.g., (filename, line)
   map() produces one or more intermediate values along with an
    output key from the input
   Buffers intermediate values in memory before periodically
    writing to local disk
   Writes are split into R regions based on intermediate key value
    (e.g., hash(key) mod R)
     • Locations of regions communicated back to master who informs
       reduce tasks of all appropriate disk locations
Reduce

   After the map phase is over, all the intermediate values for a
    given output key are combined together into a list
     • RPC over GFS to gather all the keys for a given region
     • Sort all keys since the same key can in general come from
       multiple map processes
   reduce() combines those intermediate values into one or more
    final values for that same output key
   Optional combine() phase as an optimization
Input key *value                                                    Input key *value
                     pairs                                                               pairs



                                              ...

                                  map                                                                  map
Data store 1                                                        Data store n




                (key 1,          (key 2,       (key 3,                               (key 1,            (key 2,       (key 3,
               values ...)      values ...)   values ...)                           values ...)        values ...)   values ...)



                        == Barrier == : Aggregates intermediate values by output key

                                   key 1,                              key 2,                                 key 3,
                               intermediate                        intermediate                           intermediate
                                   values                              values                                 values


                             reduce                           reduce                                  reduce




                       final key 1                          final key 2                           final key 3
                          values                               values                                values
Parallelism

   map() functions run in parallel, creating different intermediate
    values from different input data sets
   reduce() functions also run in parallel, each working on a
    different output key
   All values are processed independently
   Bottleneck: reduce phase cannot start until map phase
    completes
Example: Count word occurrences
map(String input_key, String input_value):
  // input_key: document name
  // input_value: document contents
 for each word w in input_value:
   EmitIntermediate(w, "1");


reduce(String output_key, Iterator
  intermediate_values):
  // output_key: a word
  // output_values: a list of counts
 int result = 0;
 for each v in intermediate_values:
    result += ParseInt(v);
Emit(AsString(result));
Example vs. Actual Source Code

   Example is written in pseudo-code
   Actual implementation is in C++, using a MapReduce library
   Bindings for Python and Java exist via interfaces
   True code is somewhat more involved (defines how the input
    key/values are divided up and accessed, etc.)
Implementation
Locality

   Master program divides up tasks based on location of data:
    tries to have map() tasks on same machine as physical file data
    • Failing that, on the same switch where bandwidth is relatively
      plentiful
    • Datacenter communications architecture?
   map() task inputs are divided into 64 MB blocks: same size as
    Google File System chunks
Fault Tolerance

   Master detects worker failures
    • Re-executes completed & in-progress map() tasks
    • Re-executes in-progress reduce() tasks
    • Importance of deterministic operations
   Data written to temporary files by both map() and reduce()
    • Upon successful completion, map() tells master of file names
        Master ignores if already heard from another map on same task
    • Upon successful completion, reduce() atomically renames file
   Master notices particular input key/values cause crashes in
    map(), and skips those values on re-execution.
    • Effect: Can work around bugs in third-party libraries
Optimizations

   No reduce can start until map is complete:
     • A single slow disk controller can rate-limit the whole process
   Master redundantly executes “slow-moving” map tasks; uses
    results of first copy to finish




    Why is it safe to redundantly execute map tasks? Wouldn’t this mess up
    the total computation?
Optimizations

   “Combiner” functions can run on same machine as a mapper
   Causes a mini-reduce phase to occur before the real reduce
    phase, to save bandwidth
Performance Evaluation
   1800 Machines
    • Gigabit Ethernet Switches
    • Two level tree hierarchy, 100-200 Gbps at root
    • 4GB RAM, 2Ghz dual Intel processors
    • Two 160GB drives
   Grep: 10^10 100 byte records (1 TB)
    • Search for relatively rare 3-character sequence
   Sort: 10^10 100 byte records (1 TB)
Grep Data Transfer Rate
Sort: Normal Execution
Sort: No Backup Tasks
MapReduce Conclusions

   MapReduce has proven to be a useful abstraction
   Greatly simplifies large-scale computations at Google
   Functional programming paradigm can be applied to large-scale
    applications
   Focus on problem, let library deal w/ messy details

More Related Content

What's hot

A complete introduction on matlab and matlab's projects
A complete introduction on matlab and matlab's projectsA complete introduction on matlab and matlab's projects
A complete introduction on matlab and matlab's projectsMukesh Kumar
 
Haskell for data science
Haskell for data scienceHaskell for data science
Haskell for data scienceJohn Cant
 
N-Queens Combinatorial Problem - Polyglot FP for fun and profit - Haskell and...
N-Queens Combinatorial Problem - Polyglot FP for fun and profit - Haskell and...N-Queens Combinatorial Problem - Polyglot FP for fun and profit - Haskell and...
N-Queens Combinatorial Problem - Polyglot FP for fun and profit - Haskell and...Philip Schwarz
 
Quicksort - a whistle-stop tour of the algorithm in five languages and four p...
Quicksort - a whistle-stop tour of the algorithm in five languages and four p...Quicksort - a whistle-stop tour of the algorithm in five languages and four p...
Quicksort - a whistle-stop tour of the algorithm in five languages and four p...Philip Schwarz
 
Introduction to Monads in Scala (1)
Introduction to Monads in Scala (1)Introduction to Monads in Scala (1)
Introduction to Monads in Scala (1)stasimus
 
Functional programming with haskell
Functional programming with haskellFunctional programming with haskell
Functional programming with haskellfaradjpour
 
R Workshop for Beginners
R Workshop for BeginnersR Workshop for Beginners
R Workshop for BeginnersMetamarkets
 
18. Java associative arrays
18. Java associative arrays18. Java associative arrays
18. Java associative arraysIntro C# Book
 

What's hot (20)

A complete introduction on matlab and matlab's projects
A complete introduction on matlab and matlab's projectsA complete introduction on matlab and matlab's projects
A complete introduction on matlab and matlab's projects
 
Language R
Language RLanguage R
Language R
 
Haskell for data science
Haskell for data scienceHaskell for data science
Haskell for data science
 
Functional Programming
Functional ProgrammingFunctional Programming
Functional Programming
 
Programming in R
Programming in RProgramming in R
Programming in R
 
01. haskell introduction
01. haskell introduction01. haskell introduction
01. haskell introduction
 
R Basics
R BasicsR Basics
R Basics
 
N-Queens Combinatorial Problem - Polyglot FP for fun and profit - Haskell and...
N-Queens Combinatorial Problem - Polyglot FP for fun and profit - Haskell and...N-Queens Combinatorial Problem - Polyglot FP for fun and profit - Haskell and...
N-Queens Combinatorial Problem - Polyglot FP for fun and profit - Haskell and...
 
Quicksort - a whistle-stop tour of the algorithm in five languages and four p...
Quicksort - a whistle-stop tour of the algorithm in five languages and four p...Quicksort - a whistle-stop tour of the algorithm in five languages and four p...
Quicksort - a whistle-stop tour of the algorithm in five languages and four p...
 
R Language Introduction
R Language IntroductionR Language Introduction
R Language Introduction
 
Sql server lab_2
Sql server lab_2Sql server lab_2
Sql server lab_2
 
Introduction to Monads in Scala (1)
Introduction to Monads in Scala (1)Introduction to Monads in Scala (1)
Introduction to Monads in Scala (1)
 
BasicGraphsWithR
BasicGraphsWithRBasicGraphsWithR
BasicGraphsWithR
 
Data import-cheatsheet
Data import-cheatsheetData import-cheatsheet
Data import-cheatsheet
 
Functional programming with haskell
Functional programming with haskellFunctional programming with haskell
Functional programming with haskell
 
What is matlab
What is matlabWhat is matlab
What is matlab
 
R Workshop for Beginners
R Workshop for BeginnersR Workshop for Beginners
R Workshop for Beginners
 
18. Java associative arrays
18. Java associative arrays18. Java associative arrays
18. Java associative arrays
 
Lecture notesmap
Lecture notesmapLecture notesmap
Lecture notesmap
 
Scala collections
Scala collectionsScala collections
Scala collections
 

Similar to Map Reduce

Map reduce (from Google)
Map reduce (from Google)Map reduce (from Google)
Map reduce (from Google)Sri Prasanna
 
Mapreduce: Theory and implementation
Mapreduce: Theory and implementationMapreduce: Theory and implementation
Mapreduce: Theory and implementationSri Prasanna
 
Distributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
Distributed Computing Seminar - Lecture 2: MapReduce Theory and ImplementationDistributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
Distributed Computing Seminar - Lecture 2: MapReduce Theory and Implementationtugrulh
 
Big data shim
Big data shimBig data shim
Big data shimtistrue
 
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
 
R Cheat Sheet for Data Analysts and Statisticians.pdf
R Cheat Sheet for Data Analysts and Statisticians.pdfR Cheat Sheet for Data Analysts and Statisticians.pdf
R Cheat Sheet for Data Analysts and Statisticians.pdfTimothy McBush Hiele
 
Parallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-JoinsParallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-JoinsJonny Daenen
 
Five Languages in a Moment
Five Languages in a MomentFive Languages in a Moment
Five Languages in a MomentSergio Gil
 
โปรแกรมย่อยและฟังชั่นมาตรฐาน ม.6 1
โปรแกรมย่อยและฟังชั่นมาตรฐาน ม.6 1โปรแกรมย่อยและฟังชั่นมาตรฐาน ม.6 1
โปรแกรมย่อยและฟังชั่นมาตรฐาน ม.6 1Little Tukta Lita
 
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
 
Rcommands-for those who interested in R.
Rcommands-for those who interested in R.Rcommands-for those who interested in R.
Rcommands-for those who interested in R.Dr. Volkan OBAN
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query LanguageJulian Hyde
 
A Proposition for Business Process Modeling
A Proposition for Business Process ModelingA Proposition for Business Process Modeling
A Proposition for Business Process ModelingAng Chen
 
EX-6-Implement Matrix Multiplication with Hadoop Map Reduce.pptx
EX-6-Implement Matrix Multiplication with Hadoop Map Reduce.pptxEX-6-Implement Matrix Multiplication with Hadoop Map Reduce.pptx
EX-6-Implement Matrix Multiplication with Hadoop Map Reduce.pptxvishal choudhary
 
An important part of electrical engineering is PCB design. One impor.pdf
An important part of electrical engineering is PCB design. One impor.pdfAn important part of electrical engineering is PCB design. One impor.pdf
An important part of electrical engineering is PCB design. One impor.pdfARORACOCKERY2111
 
Is Haskell an acceptable Perl?
Is Haskell an acceptable Perl?Is Haskell an acceptable Perl?
Is Haskell an acceptable Perl?osfameron
 

Similar to Map Reduce (20)

Map reduce (from Google)
Map reduce (from Google)Map reduce (from Google)
Map reduce (from Google)
 
Mapreduce: Theory and implementation
Mapreduce: Theory and implementationMapreduce: Theory and implementation
Mapreduce: Theory and implementation
 
Lec2 Mapred
Lec2 MapredLec2 Mapred
Lec2 Mapred
 
Distributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
Distributed Computing Seminar - Lecture 2: MapReduce Theory and ImplementationDistributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
Distributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
 
Big data shim
Big data shimBig data shim
Big data shim
 
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
 
R Cheat Sheet for Data Analysts and Statisticians.pdf
R Cheat Sheet for Data Analysts and Statisticians.pdfR Cheat Sheet for Data Analysts and Statisticians.pdf
R Cheat Sheet for Data Analysts and Statisticians.pdf
 
Parallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-JoinsParallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-Joins
 
purrr.pdf
purrr.pdfpurrr.pdf
purrr.pdf
 
Five Languages in a Moment
Five Languages in a MomentFive Languages in a Moment
Five Languages in a Moment
 
โปรแกรมย่อยและฟังชั่นมาตรฐาน ม.6 1
โปรแกรมย่อยและฟังชั่นมาตรฐาน ม.6 1โปรแกรมย่อยและฟังชั่นมาตรฐาน ม.6 1
โปรแกรมย่อยและฟังชั่นมาตรฐาน ม.6 1
 
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
 
Rcommands-for those who interested in R.
Rcommands-for those who interested in R.Rcommands-for those who interested in R.
Rcommands-for those who interested in R.
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query Language
 
Hw09 Hadoop + Clojure
Hw09   Hadoop + ClojureHw09   Hadoop + Clojure
Hw09 Hadoop + Clojure
 
A Proposition for Business Process Modeling
A Proposition for Business Process ModelingA Proposition for Business Process Modeling
A Proposition for Business Process Modeling
 
EX-6-Implement Matrix Multiplication with Hadoop Map Reduce.pptx
EX-6-Implement Matrix Multiplication with Hadoop Map Reduce.pptxEX-6-Implement Matrix Multiplication with Hadoop Map Reduce.pptx
EX-6-Implement Matrix Multiplication with Hadoop Map Reduce.pptx
 
An important part of electrical engineering is PCB design. One impor.pdf
An important part of electrical engineering is PCB design. One impor.pdfAn important part of electrical engineering is PCB design. One impor.pdf
An important part of electrical engineering is PCB design. One impor.pdf
 
Is Haskell an acceptable Perl?
Is Haskell an acceptable Perl?Is Haskell an acceptable Perl?
Is Haskell an acceptable Perl?
 
Hadoop + Clojure
Hadoop + ClojureHadoop + Clojure
Hadoop + Clojure
 

More from Sri Prasanna

More from Sri Prasanna (20)

Qr codes para tech radar
Qr codes para tech radarQr codes para tech radar
Qr codes para tech radar
 
Qr codes para tech radar 2
Qr codes para tech radar 2Qr codes para tech radar 2
Qr codes para tech radar 2
 
Test
TestTest
Test
 
Test
TestTest
Test
 
assds
assdsassds
assds
 
assds
assdsassds
assds
 
asdsa
asdsaasdsa
asdsa
 
dsd
dsddsd
dsd
 
About stacks
About stacksAbout stacks
About stacks
 
About Stacks
About  StacksAbout  Stacks
About Stacks
 
About Stacks
About  StacksAbout  Stacks
About Stacks
 
About Stacks
About  StacksAbout  Stacks
About Stacks
 
About Stacks
About  StacksAbout  Stacks
About Stacks
 
About Stacks
About  StacksAbout  Stacks
About Stacks
 
About Stacks
About StacksAbout Stacks
About Stacks
 
About Stacks
About StacksAbout Stacks
About Stacks
 
Network and distributed systems
Network and distributed systemsNetwork and distributed systems
Network and distributed systems
 
Introduction & Parellelization on large scale clusters
Introduction & Parellelization on large scale clustersIntroduction & Parellelization on large scale clusters
Introduction & Parellelization on large scale clusters
 
Other distributed systems
Other distributed systemsOther distributed systems
Other distributed systems
 
Distributed file systems
Distributed file systemsDistributed file systems
Distributed file systems
 

Recently uploaded

ANTI PARKISON DRUGS.pptx
ANTI         PARKISON          DRUGS.pptxANTI         PARKISON          DRUGS.pptx
ANTI PARKISON DRUGS.pptxPoojaSen20
 
Improved Approval Flow in Odoo 17 Studio App
Improved Approval Flow in Odoo 17 Studio AppImproved Approval Flow in Odoo 17 Studio App
Improved Approval Flow in Odoo 17 Studio AppCeline George
 
male presentation...pdf.................
male presentation...pdf.................male presentation...pdf.................
male presentation...pdf.................MirzaAbrarBaig5
 
Basic Civil Engineering notes on Transportation Engineering & Modes of Transport
Basic Civil Engineering notes on Transportation Engineering & Modes of TransportBasic Civil Engineering notes on Transportation Engineering & Modes of Transport
Basic Civil Engineering notes on Transportation Engineering & Modes of TransportDenish Jangid
 
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT TOÁN 2024 - TỪ CÁC TRƯỜNG, TRƯỜNG...
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT TOÁN 2024 - TỪ CÁC TRƯỜNG, TRƯỜNG...TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT TOÁN 2024 - TỪ CÁC TRƯỜNG, TRƯỜNG...
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT TOÁN 2024 - TỪ CÁC TRƯỜNG, TRƯỜNG...Nguyen Thanh Tu Collection
 
demyelinated disorder: multiple sclerosis.pptx
demyelinated disorder: multiple sclerosis.pptxdemyelinated disorder: multiple sclerosis.pptx
demyelinated disorder: multiple sclerosis.pptxMohamed Rizk Khodair
 
Spring gala 2024 photo slideshow - Celebrating School-Community Partnerships
Spring gala 2024 photo slideshow - Celebrating School-Community PartnershipsSpring gala 2024 photo slideshow - Celebrating School-Community Partnerships
Spring gala 2024 photo slideshow - Celebrating School-Community Partnershipsexpandedwebsite
 
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...Nguyen Thanh Tu Collection
 
How to Manage Website in Odoo 17 Studio App.pptx
How to Manage Website in Odoo 17 Studio App.pptxHow to Manage Website in Odoo 17 Studio App.pptx
How to Manage Website in Odoo 17 Studio App.pptxCeline George
 
Analyzing and resolving a communication crisis in Dhaka textiles LTD.pptx
Analyzing and resolving a communication crisis in Dhaka textiles LTD.pptxAnalyzing and resolving a communication crisis in Dhaka textiles LTD.pptx
Analyzing and resolving a communication crisis in Dhaka textiles LTD.pptxLimon Prince
 
8 Tips for Effective Working Capital Management
8 Tips for Effective Working Capital Management8 Tips for Effective Working Capital Management
8 Tips for Effective Working Capital ManagementMBA Assignment Experts
 
An overview of the various scriptures in Hinduism
An overview of the various scriptures in HinduismAn overview of the various scriptures in Hinduism
An overview of the various scriptures in HinduismDabee Kamal
 
MOOD STABLIZERS DRUGS.pptx
MOOD     STABLIZERS           DRUGS.pptxMOOD     STABLIZERS           DRUGS.pptx
MOOD STABLIZERS DRUGS.pptxPoojaSen20
 
24 ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH SỞ GIÁO DỤC HẢI DƯ...
24 ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH SỞ GIÁO DỤC HẢI DƯ...24 ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH SỞ GIÁO DỤC HẢI DƯ...
24 ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH SỞ GIÁO DỤC HẢI DƯ...Nguyen Thanh Tu Collection
 
Major project report on Tata Motors and its marketing strategies
Major project report on Tata Motors and its marketing strategiesMajor project report on Tata Motors and its marketing strategies
Major project report on Tata Motors and its marketing strategiesAmanpreetKaur157993
 
Scopus Indexed Journals 2024 - ISCOPUS Publications
Scopus Indexed Journals 2024 - ISCOPUS PublicationsScopus Indexed Journals 2024 - ISCOPUS Publications
Scopus Indexed Journals 2024 - ISCOPUS PublicationsISCOPE Publication
 
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文中 央社
 
An Overview of the Odoo 17 Knowledge App
An Overview of the Odoo 17 Knowledge AppAn Overview of the Odoo 17 Knowledge App
An Overview of the Odoo 17 Knowledge AppCeline George
 
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...EADTU
 

Recently uploaded (20)

ANTI PARKISON DRUGS.pptx
ANTI         PARKISON          DRUGS.pptxANTI         PARKISON          DRUGS.pptx
ANTI PARKISON DRUGS.pptx
 
Improved Approval Flow in Odoo 17 Studio App
Improved Approval Flow in Odoo 17 Studio AppImproved Approval Flow in Odoo 17 Studio App
Improved Approval Flow in Odoo 17 Studio App
 
male presentation...pdf.................
male presentation...pdf.................male presentation...pdf.................
male presentation...pdf.................
 
Basic Civil Engineering notes on Transportation Engineering & Modes of Transport
Basic Civil Engineering notes on Transportation Engineering & Modes of TransportBasic Civil Engineering notes on Transportation Engineering & Modes of Transport
Basic Civil Engineering notes on Transportation Engineering & Modes of Transport
 
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT TOÁN 2024 - TỪ CÁC TRƯỜNG, TRƯỜNG...
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT TOÁN 2024 - TỪ CÁC TRƯỜNG, TRƯỜNG...TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT TOÁN 2024 - TỪ CÁC TRƯỜNG, TRƯỜNG...
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT TOÁN 2024 - TỪ CÁC TRƯỜNG, TRƯỜNG...
 
demyelinated disorder: multiple sclerosis.pptx
demyelinated disorder: multiple sclerosis.pptxdemyelinated disorder: multiple sclerosis.pptx
demyelinated disorder: multiple sclerosis.pptx
 
Spring gala 2024 photo slideshow - Celebrating School-Community Partnerships
Spring gala 2024 photo slideshow - Celebrating School-Community PartnershipsSpring gala 2024 photo slideshow - Celebrating School-Community Partnerships
Spring gala 2024 photo slideshow - Celebrating School-Community Partnerships
 
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...
 
How to Manage Website in Odoo 17 Studio App.pptx
How to Manage Website in Odoo 17 Studio App.pptxHow to Manage Website in Odoo 17 Studio App.pptx
How to Manage Website in Odoo 17 Studio App.pptx
 
Analyzing and resolving a communication crisis in Dhaka textiles LTD.pptx
Analyzing and resolving a communication crisis in Dhaka textiles LTD.pptxAnalyzing and resolving a communication crisis in Dhaka textiles LTD.pptx
Analyzing and resolving a communication crisis in Dhaka textiles LTD.pptx
 
8 Tips for Effective Working Capital Management
8 Tips for Effective Working Capital Management8 Tips for Effective Working Capital Management
8 Tips for Effective Working Capital Management
 
An overview of the various scriptures in Hinduism
An overview of the various scriptures in HinduismAn overview of the various scriptures in Hinduism
An overview of the various scriptures in Hinduism
 
MOOD STABLIZERS DRUGS.pptx
MOOD     STABLIZERS           DRUGS.pptxMOOD     STABLIZERS           DRUGS.pptx
MOOD STABLIZERS DRUGS.pptx
 
24 ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH SỞ GIÁO DỤC HẢI DƯ...
24 ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH SỞ GIÁO DỤC HẢI DƯ...24 ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH SỞ GIÁO DỤC HẢI DƯ...
24 ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH SỞ GIÁO DỤC HẢI DƯ...
 
Major project report on Tata Motors and its marketing strategies
Major project report on Tata Motors and its marketing strategiesMajor project report on Tata Motors and its marketing strategies
Major project report on Tata Motors and its marketing strategies
 
Scopus Indexed Journals 2024 - ISCOPUS Publications
Scopus Indexed Journals 2024 - ISCOPUS PublicationsScopus Indexed Journals 2024 - ISCOPUS Publications
Scopus Indexed Journals 2024 - ISCOPUS Publications
 
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
 
An Overview of the Odoo 17 Knowledge App
An Overview of the Odoo 17 Knowledge AppAn Overview of the Odoo 17 Knowledge App
An Overview of the Odoo 17 Knowledge App
 
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
 
Including Mental Health Support in Project Delivery, 14 May.pdf
Including Mental Health Support in Project Delivery, 14 May.pdfIncluding Mental Health Support in Project Delivery, 14 May.pdf
Including Mental Health Support in Project Delivery, 14 May.pdf
 

Map Reduce

  • 1. Map/Reduce Based on original presentation by Christophe Bisciglia, Aaron Kimball, and Sierra Michels-Slettvet Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License.
  • 2. Functional Programming Review  Functional operations do not modify data structures: They always create new ones  Original data still exists in unmodified form  Data flows are implicit in program design  Order of operations does not matter
  • 3. Functional Programming Review fun foo(l: int list) = sum(l) + mul(l) + length(l) Order of sum() and mul(), etc does not matter – they do not modify l
  • 4. Functional Updates Do Not Modify Structures fun append(x, lst) = let lst' = reverse lst in reverse ( x :: lst' ) The append() function above reverses a list, adds a new element to the front, and returns all of that, reversed, which appends an item. But it never modifies lst!
  • 5. Functions Can Be Used As Arguments fun DoDouble(f, x) = f (f x) It does not matter what f does to its argument; DoDouble() will do it twice. What is the type of this function?
  • 6. Map map f a [] = f(a) map f (a:as) = list(f(a), map(f, as)) Creates a new list by applying f to each element of the input list; returns output in order. f f f f f f
  • 7. Fold fold f x0 lst: ('a*'b->'b)->'b->('a list)->'b Moves across a list, applying f to each element plus an accumulator. f returns the next accumulator value, which is combined with the next element of the list f f f f f returned initial
  • 8. fold left vs. fold right  Order of list elements can be significant  Fold left moves left-to-right across the list  Fold right moves from right-to-left SML Implementation: fun foldl f a [] = a | foldl f a (x::xs) = foldl f (f(x, a)) xs fun foldr f a [] = a | foldr f a (x::xs) = f(x, (foldr f a xs))
  • 9. Example fun foo(l: int list) = sum(l) + mul(l) + length(l) How can we implement this?
  • 10. Example (Solved) fun foo(l: int list) = sum(l) + mul(l) + length(l) fun sum(lst) = foldl (fn (x,a)=>x+a) 0 lst fun mul(lst) = foldl (fn (x,a)=>x*a) 1 lst fun length(lst) = foldl (fn (x,a)=>1+a) 0 lst
  • 11. A More Complicated Fold Problem  Given a list of numbers, how can we generate a list of partial sums? e.g.: [1, 4, 8, 3, 7, 9]  [0, 1, 5, 13, 16, 23, 32]
  • 12. A More Complicated Fold Problem  Given a list of numbers, how can we generate a list of partial sums? e.g.: [1, 4, 8, 3, 7, 9]  [0, 1, 5, 13, 16, 23, 32] fun partialsum(lst) = foldl(fn(x,a) => list(a (last(a) + x))) 0 lst
  • 13. A More Complicated Map Problem  Given a list of words, can we: reverse the letters in each word, and reverse the whole list, so it all comes out backwards? [“my”, “happy”, “cat”] -> [“tac”, “yppah”, “ym”]
  • 14. A More Complicated Map Problem  Given a list of words, can we: reverse the letters in each word, and reverse the whole list, so it all comes out backwards? [“my”, “happy”, “cat”] -> [“tac”, “yppah”, “ym”] fun reverse2(lst) = foldr(fn(x,a)=>list(a, reverseword(x)) [] lst
  • 15. map Implementation fun map f [] = [] | map f (x::xs) = (f x) :: (map f xs)  This implementation moves left-to-right across the list, mapping elements one at a time  … But does it need to?
  • 16. Implicit Parallelism In Map  In a purely functional setting, elements of a list being computed by map cannot see the effects of the computations on other elements  If order of application of f to elements in list is commutative, we can reorder or parallelize execution  This is the insight behind MapReduce
  • 17. Motivation: Large Scale Data Processing  Want to process lots of data ( > 1 TB)  Want to parallelize across hundreds/thousands of CPUs  … Want to make this easy • Hide the details of parallelism, machine management, fault tolerance, etc.
  • 18. Sample Applications  Distributed Greo  Count of URL Access Frequency  Reverse Web-Lijk Graph  Inverted Index  Distributed Sort
  • 19. MapReduce  Automatic parallelization & distribution  Fault-tolerant  Provides status and monitoring tools  Clean abstraction for programmers
  • 20. Programming Model  Borrows from functional programming  Users implement interface of two functions: • map (in_key, in_value) -> (out_key, intermediate_value) list • reduce (out_key, intermediate_value list) -> out_value list
  • 21. Map  Records from the data source (lines out of files, rows of a database, etc) are fed into the map function as key*value pairs: e.g., (filename, line)  map() produces one or more intermediate values along with an output key from the input  Buffers intermediate values in memory before periodically writing to local disk  Writes are split into R regions based on intermediate key value (e.g., hash(key) mod R) • Locations of regions communicated back to master who informs reduce tasks of all appropriate disk locations
  • 22. Reduce  After the map phase is over, all the intermediate values for a given output key are combined together into a list • RPC over GFS to gather all the keys for a given region • Sort all keys since the same key can in general come from multiple map processes  reduce() combines those intermediate values into one or more final values for that same output key  Optional combine() phase as an optimization
  • 23. Input key *value Input key *value pairs pairs ... map map Data store 1 Data store n (key 1, (key 2, (key 3, (key 1, (key 2, (key 3, values ...) values ...) values ...) values ...) values ...) values ...) == Barrier == : Aggregates intermediate values by output key key 1, key 2, key 3, intermediate intermediate intermediate values values values reduce reduce reduce final key 1 final key 2 final key 3 values values values
  • 24. Parallelism  map() functions run in parallel, creating different intermediate values from different input data sets  reduce() functions also run in parallel, each working on a different output key  All values are processed independently  Bottleneck: reduce phase cannot start until map phase completes
  • 25. Example: Count word occurrences map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));
  • 26. Example vs. Actual Source Code  Example is written in pseudo-code  Actual implementation is in C++, using a MapReduce library  Bindings for Python and Java exist via interfaces  True code is somewhat more involved (defines how the input key/values are divided up and accessed, etc.)
  • 28. Locality  Master program divides up tasks based on location of data: tries to have map() tasks on same machine as physical file data • Failing that, on the same switch where bandwidth is relatively plentiful • Datacenter communications architecture?  map() task inputs are divided into 64 MB blocks: same size as Google File System chunks
  • 29. Fault Tolerance  Master detects worker failures • Re-executes completed & in-progress map() tasks • Re-executes in-progress reduce() tasks • Importance of deterministic operations  Data written to temporary files by both map() and reduce() • Upon successful completion, map() tells master of file names Master ignores if already heard from another map on same task • Upon successful completion, reduce() atomically renames file  Master notices particular input key/values cause crashes in map(), and skips those values on re-execution. • Effect: Can work around bugs in third-party libraries
  • 30. Optimizations  No reduce can start until map is complete: • A single slow disk controller can rate-limit the whole process  Master redundantly executes “slow-moving” map tasks; uses results of first copy to finish Why is it safe to redundantly execute map tasks? Wouldn’t this mess up the total computation?
  • 31. Optimizations  “Combiner” functions can run on same machine as a mapper  Causes a mini-reduce phase to occur before the real reduce phase, to save bandwidth
  • 32. Performance Evaluation  1800 Machines • Gigabit Ethernet Switches • Two level tree hierarchy, 100-200 Gbps at root • 4GB RAM, 2Ghz dual Intel processors • Two 160GB drives  Grep: 10^10 100 byte records (1 TB) • Search for relatively rare 3-character sequence  Sort: 10^10 100 byte records (1 TB)
  • 36. MapReduce Conclusions  MapReduce has proven to be a useful abstraction  Greatly simplifies large-scale computations at Google  Functional programming paradigm can be applied to large-scale applications  Focus on problem, let library deal w/ messy details