SlideShare a Scribd company logo
1 of 41
Download to read offline
Cloud Computing & MapReduce:
Parallel Processing on a Massive Scale
     Geoff Rothman (rothman@hp.com)
              March 27, 2010
Outline
1. Overview of Cloud Computing
  – Establish a general definition

2. Overview of Google MapReduce
  – Parallel programming with Cloud Computing

3. Debate between MapReduce & Parallel DBMS
  – Is one better than the other or are they
    complementary?
Overview of Cloud Computing
Cloud Computing: What Does It Mean?
• On-demand network access to shared pool of
  configurable computing resources [1]




                                     [2]
NIST View of Cloud Computing

• Five characteristics

• Three service models

• Four deployment models
Cloud Computing Characteristics

• On-Demand & Automated

• Broad network access

• Resource Pooling

• Rapid Elasticity

• Measured Service
“SPI Model - as a Service”
• Software as a Service (SaaS):
   – Application system (Salesforce, WebEx)

• Platform as a Service (PaaS):
   – Infrastructure pre-existing; simply code and deploy
     (Google AppEngine, MS Azure, Force.com)

• Infrastructure as a Service (IaaS):
   – Raw infrastructure, servers and storage provided on-
     demand (Amazon Web Services, GoGrid) [3]
[4]
[5]
Cloud Deployment Models
• Private
   – Single tenant, owned and managed by company or service provider
     either on or off-premise; consumers are trusted

• Public
   – Single or multitenant (shared), owned by service provider off-premise;
     consumers are untrusted

• Managed
   – Single or multi-tenant (shared), located in org’s datacenter but
     managed and secured by Service Provider; consumers are trusted or
     untrusted

• Hybrid
   – Combination of public/private offering; “cloud burst”; consumers are
     trusted or untrusted
Why use the Cloud? CFO View

• Operational vs Capital Expenditures

• Better Cash Flow

• Limited Financial Risk

• Better Balance Sheet

• Outsource non-core competencies [7]
Why Use the Cloud? CIO View
• Analytics
• Parallel Batch Processing
• Compute intensive desktops apps [6]
• Mobile Interactive Apps (GUI for mashups) [6]
• Webserver uptime / redundancy
• Accelerate project rollouts
Overview of Google MapReduce
Cloud Computing & Parallel Batch
Processing: Overview of Map/Reduce
• Developed by Google to perform simple
  computations on massive amounts of data
  ( > 1TB) in a substantially reduced amount of time

• Hides details for
   –   Parallelization
   –   Data distribution
   –   Load balancing
   –   Fault tolerance
MapReduce Programming Model [8]
Input & Output: each a set of key/value pairs

Code two functions: map & reduce

map (in_key, in_value) -> list(out_key, intermediate_value)
• Processes input key/value pair
• Produces set of intermediate pairs

reduce (out_key, list(intermediate_value)) -> list(out_value)
• Combines all intermediate values for a particular key
• Produces a set of merged output values (usually just one)
Case 1: Word Count
Determine frequency of words in a file.

Map function (assign a value of 1 to every word):
- input is (file offset, various text)
- output is a key-value pair [(word,1)]

MR Library Shuffle Step takes Map Output and groups by Keys by
Hash function.

Reduce function (total counts per word):
- input is (word, [1,1,1])
- output is (word, count)
Word Count – Sample Code [9]
map(String key, String value):
  // key: document name
  // value: document contents
  for each word w in value:
        EmitIntermediate(w, "1");

reduce(String key, Iterator values):
   // key: a word
   // values: a list of counts
   int result = 0;
   for each v in values:
         result += ParseInt(v);
   Emit(AsString(result));
Word Count                                  Result
File 1     i love to code   File 2      to code is to love                    code,2
                                                                                 i,1
                                                                                is,1
Map tasks:                                                                    love,2
                                     Reduce tasks:                             to,3
Map1                                 Reducer1         File1
[(i,1)]
[(love,1)]                           (code, [1,1]) -> (code,2)
[(to,1)]                             (i, [1])      -> (i,1)
[(code,1)]                           (is,[1])      -> (is,1)
Map2          MR Library groups
[(to,1)]      intermediate keys      Reducer2            File2
[(code,1)]    and values in          (love,[1,1])      -> (love,2)
[(is,1)]      “Shuffle Phase”        (to,[1,1,1])      -> (to,3)
[(to,1)]
[(love,1]

* File2 will have a key value pair of (to,2) after map when using MR Combiner functionality
MapReduce Features
• Fault Tolerance
• Redundant Execution
• Locality Optimization
• Skip Bad Records
• Sort before Reduce
• Combiner
MapReduce System Flow [8]
MapReduce Function Flow [8]
Map & Reduce Parallel Execution [8]
Case 2: Distributed Grep
Counts lines in all files that match a <regex> and displays counts.

Other uses include: analyzing web server access logs to find the
top requested pages that match a given pattern

Map function (establish a match):
- input is (file offset, char)
- output is either:
      1. an empty list [] (the line does not match ‘A’ or ‘C’)
      2. a key-value pair [(line, 1)] (if it matches)

Reduce function (total counts):
- input is (char, [1, 1, ...])
- output is (char, n) where n is the number of 1s in the list.

            http://web.cs.wpi.edu/~cs4513/d08/OtherStuff/MapReduce-TeamC.ppt
Distributed Grep
    File 1      C    File 2                 Result
                              C
                B             A                3C
                B
                                               1A
                C

Map tasks:
File1                             Reduce tasks:
(0, C) -> [(C, 1)]                (A, [1])      -> (A, 1)
(2, B) -> []                      (C, [1, 1, 1]) -> (C, 3)
(4, B) -> []
(6, C) -> [(C, 1)]

File2
(0, C) -> [(C, 1)]
(2, A) -> [(A, 1)]
Case 3: Max Speed Serve
-Data Analysis Needed: for all professional tennis tournaments
the past 3 years, process log files to determine fastest speed
serve each year.

Map function (enumerate speeds for each year):
- input is (file offset, Year Speed)
- output is a key-value pair [(Year,Speed)]

Reduce function (determine max speed each year):
 - input is (Year, [speed1, … speedN])
 - output is (Year, Speed) where Speed is the fastest recorded
that year.
Max Speed Serve
                                                         Result
 File 1                  File 2    2008 134
           2008 136                                       2008- 136
           2009 126                2009 127
                                                          2009- 132
           2009 132                2010 124
                                                          2010- 124



  Map tasks:                    Reduce tasks:
  [(2008,136)]
                                (2008, [136, 134])   -> (2008,136)
  [(2009,126)]
                                (2009,[126,132,127]) -> (2009,132)
  [(2009,132)]
  [(2008,134)]                  (2010,[124]           -> (2010,124)
  [(2009,127)]
  [(2010,124)]


* Will drop value when using MR Combiner functionality
Case 4: Word Proximity
Find occurrences of pairs of words where word1 is located within
4 words of word2.

Map function (assign a value of 1 to every match):
- input is (file offset, various text)
- output is a key-value pair [(word1|word2,1)]

Reduce function (total count per match):
- input is (word1|word2, [1,1,1])
- output is (word1|word2, count)
Word Proximity
File 1                       File 2                          Result
i have a piece of the pie        it is a piece of cake; it      piece|pie,1
                                 doesn’t even look like
                                            pie

Word1 = “piece” Word2 = “pie”

Map tasks:
(0,i have a piece of the pie)  (piece|pie,1)
(0,it is a piece of cake; it doesn’t even look like pie)  ()

Reduce tasks:
(piece|pie, [1]) -> (piece|pie,1)
Case 5: Reverse Web-Link Graph
Given a list of website home pages (W1…W4) and every link on
that page, point the destination sites back to the original source
web site.

Map function
- input is (adjacency list in format source: dest1, dest2..)
- output is a key-value pair[dest,source]

Reduce function (create adjacency list with dest as key):
-input/output is (dest,[source1, source2])
Link Reversal
Input: Adjacency List                       Output: reversed list
 W1: W2,W4                                  W1: W2,W4
 W2: W1,W3,W4                               W2: W1
 W3: W4                                     W3: W2,W4
 W4: W1,W3                                  W4: W1,W2,W3



Map tasks:                                  Reduce tasks:
(W1,W2) -> (W2,W1)
                                            (W1,[W2,W4])
(W1,W4) -> (W4,W1)
                                            (W2,[W1]
(W2,W1) -> (W1,W2)      MR Library groups
                                            (W3,[W2,W4]
(W2,W3) -> (W3,W2)      intermediate keys
(W2,W4) -> (W4,W2)      and values in       (W4,[W1,W2,W3]
                        “Shuffle Phase”
(W3,W4) -> (W4,W3)
(W4,W1) -> (W1,W4)
(W4,W3) -> (W3,W4)
Why Use MapReduce?

• Hides messy details of distributed infrastructure

• MapReduce simplifies programming paradigm to
  allow for easy parallel execution

• Easily scales to thousands of machines
MapReduce Jobs Run @ Google [15]

                              Aug. '04   Mar. '06   Sep. '07
Number of jobs (1000s)        29         171        2,217
Avg. completion time (secs)   634        874        395
Machine years used            217        2,002      11,081
map input data (TB)           3,288      52,254     403,152
map output data (TB)          758        6,743      34,774
reduce output data (TB)       193        2,970      14,018
Avg. machines per job         157        268        394
Unique implementations
map                           395        1958       4083
reduce                        269        1208       2418
Current Debate:
MapReduce vs Parallel DBMS
Why Not Use A Parallel DBMS?
• Parallel DBMS:
  – multiple CPUs, multiple servers
  – classic parallel programming concepts
  – HUGE established industry $$$


• Parallel DBMS Vendors
  – Teradata (NCR), DB2 (IBM), Oracle (via exadata),
    Greenplum, Vertica etc.
“MapReduce is a Major Step Backward”

Stonebraker & Dewitt attack on MR (1/17/08) [10,11]
  – a step backwards in database access
  – a poor implementation
  – not novel
  – missing features
  – incompatible with DBMS tools
“Comparison of Approaches to Large-Scale Data
                 Analysis”
Stonebraker Dewitt Comparison of Hadoop MR vs Vertica &
   DBMS-X (7/2009) [12]

   – Hadoop
      •   easy to install, get up & running
      •   Maintenance of apps harder
      •   Good for fault tolerance in queries
      •   Slow because of reading entire file each time & pull of file on
          reduce step
   – Vertica & DBMS-X
      • much faster than Hadoop because of indexes, schema,
        column orientation, compression & “warm start-up at boot
        time”.
“MapReduce and Parallel DBMSs: Friends or Foes?”

Dewitt & Stonebraker update their position (1/2010) [13]

   – Hadoop MR and Parallel DBMS are complementary

   – Use Hadoop MR for subsets of tasks

   – Use Parallel DBMS for all other applications

   – Hadoop still needs significant improvements
“MapReduce: A Flexible Data Processing Tool”
Jeffrey Dean & Sanjay Ghemawat (Google) rebuttal (1/2010) [14]


   – MR can input data from heterogenous environments

   – MR can use indices as input to MR

   – Useful for Complex functions

   – “Protocol Buffers” parse much faster

   – MR pull model non-negotiable

   – Addresses performance concerns
Conclusions
• Hadoop MapReduce solid choice for leveraging
  power of Cloud Computing when tackling specific
  parallel data processing tasks; use PDBMS for all
  other tasks.

• MR and PDBMS can learn from each other

• Open source Hadoop MR continues to gain ground
  on performance and efficiency

• Battle of MR vs PDBMS subsiding for now
Questions???
References
[1] http://csrc.nist.gov/groups/SNS/cloud-computing/cloud-def-v15.doc
[2] http://en.wikipedia.org/wiki/File:Cloud_computing.svg
[3] http://news.cnet.com/8301-19413_3-10140278-240.html?tag=mncol;txt
[4] http://rationalsecurity.typepad.com/blog/2009/01/cloud-computing-taxonomy-ontology-please-
     review.html
[5] http://www.opencrowd.com/views/cloud.php/2Security
[6] http://berkeleyclouds.blogspot.com
[7] Forrester Research, Talking to Your CFO About Cloud Computing, Ted Schadler; Oct. 29, 2008.
[8] http://code.google.com/edu/parallel/mapreduce-tutorial.html
[9] http://labs.google.com/papers/mapreduce.html
[10] http://databasecolumn.vertica.com/database-innovation/mapreduce-a-major-step-backwards/
[11] http://databasecolumn.vertica.com/database-innovation/mapreduce-ii/
[12] “Comparison of Approaches to Large-Scale Data Analysis”, Pavlo, Abadi, Stonebraker, Dewitt , et al
     (7/2009)
[13] ACM, “MapReduce and Parallel DBMSs: Friends or Foes?”, Stonebraker, Abadi, Dewitt, et al (1/2010)
[14] ACM, “MapReduce: A Flexible Data Processing Tool”, Jeffrey Dean & Sanjay Ghemawat (1/2010)
[15] http://googlesystem.blogspot.com/2008/01/google-reveals-more-mapreduce-stats.html

More Related Content

Similar to Geoff Rothman Presentation on Parallel Processing

Big Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindBig Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindEMC
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopApache Apex
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part IMarin Dimitrov
 
Pairwise document similarity in large collections with map reduce
Pairwise document similarity in large collections with map reducePairwise document similarity in large collections with map reduce
Pairwise document similarity in large collections with map reducenivedalk
 
Pairwise document similarity in large collections with map reduce
Pairwise document similarity in large collections with map reducePairwise document similarity in large collections with map reduce
Pairwise document similarity in large collections with map reducenivedalk
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkDuyhai Doan
 
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ..."MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...Adrian Florea
 
R for Pirates. ESCCONF October 27, 2011
R for Pirates. ESCCONF October 27, 2011R for Pirates. ESCCONF October 27, 2011
R for Pirates. ESCCONF October 27, 2011Mandi Walls
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxHARIKRISHNANU13
 

Similar to Geoff Rothman Presentation on Parallel Processing (20)

Big Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindBig Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilind
 
iot.pptx
iot.pptxiot.pptx
iot.pptx
 
mapreduce ppt.ppt
mapreduce ppt.pptmapreduce ppt.ppt
mapreduce ppt.ppt
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Intro to Map Reduce
Intro to Map ReduceIntro to Map Reduce
Intro to Map Reduce
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 
MapReduce
MapReduceMapReduce
MapReduce
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
L3.fa14.ppt
L3.fa14.pptL3.fa14.ppt
L3.fa14.ppt
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
 
Hadoop
HadoopHadoop
Hadoop
 
Spark meets Telemetry
Spark meets TelemetrySpark meets Telemetry
Spark meets Telemetry
 
Pairwise document similarity in large collections with map reduce
Pairwise document similarity in large collections with map reducePairwise document similarity in large collections with map reduce
Pairwise document similarity in large collections with map reduce
 
Pairwise document similarity in large collections with map reduce
Pairwise document similarity in large collections with map reducePairwise document similarity in large collections with map reduce
Pairwise document similarity in large collections with map reduce
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ..."MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
 
R for Pirates. ESCCONF October 27, 2011
R for Pirates. ESCCONF October 27, 2011R for Pirates. ESCCONF October 27, 2011
R for Pirates. ESCCONF October 27, 2011
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 

Geoff Rothman Presentation on Parallel Processing

  • 1. Cloud Computing & MapReduce: Parallel Processing on a Massive Scale Geoff Rothman (rothman@hp.com) March 27, 2010
  • 2. Outline 1. Overview of Cloud Computing – Establish a general definition 2. Overview of Google MapReduce – Parallel programming with Cloud Computing 3. Debate between MapReduce & Parallel DBMS – Is one better than the other or are they complementary?
  • 3. Overview of Cloud Computing
  • 4. Cloud Computing: What Does It Mean? • On-demand network access to shared pool of configurable computing resources [1] [2]
  • 5. NIST View of Cloud Computing • Five characteristics • Three service models • Four deployment models
  • 6. Cloud Computing Characteristics • On-Demand & Automated • Broad network access • Resource Pooling • Rapid Elasticity • Measured Service
  • 7. “SPI Model - as a Service” • Software as a Service (SaaS): – Application system (Salesforce, WebEx) • Platform as a Service (PaaS): – Infrastructure pre-existing; simply code and deploy (Google AppEngine, MS Azure, Force.com) • Infrastructure as a Service (IaaS): – Raw infrastructure, servers and storage provided on- demand (Amazon Web Services, GoGrid) [3]
  • 8. [4]
  • 9. [5]
  • 10. Cloud Deployment Models • Private – Single tenant, owned and managed by company or service provider either on or off-premise; consumers are trusted • Public – Single or multitenant (shared), owned by service provider off-premise; consumers are untrusted • Managed – Single or multi-tenant (shared), located in org’s datacenter but managed and secured by Service Provider; consumers are trusted or untrusted • Hybrid – Combination of public/private offering; “cloud burst”; consumers are trusted or untrusted
  • 11. Why use the Cloud? CFO View • Operational vs Capital Expenditures • Better Cash Flow • Limited Financial Risk • Better Balance Sheet • Outsource non-core competencies [7]
  • 12. Why Use the Cloud? CIO View • Analytics • Parallel Batch Processing • Compute intensive desktops apps [6] • Mobile Interactive Apps (GUI for mashups) [6] • Webserver uptime / redundancy • Accelerate project rollouts
  • 13. Overview of Google MapReduce
  • 14. Cloud Computing & Parallel Batch Processing: Overview of Map/Reduce • Developed by Google to perform simple computations on massive amounts of data ( > 1TB) in a substantially reduced amount of time • Hides details for – Parallelization – Data distribution – Load balancing – Fault tolerance
  • 15. MapReduce Programming Model [8] Input & Output: each a set of key/value pairs Code two functions: map & reduce map (in_key, in_value) -> list(out_key, intermediate_value) • Processes input key/value pair • Produces set of intermediate pairs reduce (out_key, list(intermediate_value)) -> list(out_value) • Combines all intermediate values for a particular key • Produces a set of merged output values (usually just one)
  • 16. Case 1: Word Count Determine frequency of words in a file. Map function (assign a value of 1 to every word): - input is (file offset, various text) - output is a key-value pair [(word,1)] MR Library Shuffle Step takes Map Output and groups by Keys by Hash function. Reduce function (total counts per word): - input is (word, [1,1,1]) - output is (word, count)
  • 17. Word Count – Sample Code [9] map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));
  • 18. Word Count Result File 1 i love to code File 2 to code is to love code,2 i,1 is,1 Map tasks: love,2 Reduce tasks: to,3 Map1 Reducer1 File1 [(i,1)] [(love,1)] (code, [1,1]) -> (code,2) [(to,1)] (i, [1]) -> (i,1) [(code,1)] (is,[1]) -> (is,1) Map2 MR Library groups [(to,1)] intermediate keys Reducer2 File2 [(code,1)] and values in (love,[1,1]) -> (love,2) [(is,1)] “Shuffle Phase” (to,[1,1,1]) -> (to,3) [(to,1)] [(love,1] * File2 will have a key value pair of (to,2) after map when using MR Combiner functionality
  • 19. MapReduce Features • Fault Tolerance • Redundant Execution • Locality Optimization • Skip Bad Records • Sort before Reduce • Combiner
  • 22. Map & Reduce Parallel Execution [8]
  • 23. Case 2: Distributed Grep Counts lines in all files that match a <regex> and displays counts. Other uses include: analyzing web server access logs to find the top requested pages that match a given pattern Map function (establish a match): - input is (file offset, char) - output is either: 1. an empty list [] (the line does not match ‘A’ or ‘C’) 2. a key-value pair [(line, 1)] (if it matches) Reduce function (total counts): - input is (char, [1, 1, ...]) - output is (char, n) where n is the number of 1s in the list. http://web.cs.wpi.edu/~cs4513/d08/OtherStuff/MapReduce-TeamC.ppt
  • 24. Distributed Grep File 1 C File 2 Result C B A 3C B 1A C Map tasks: File1 Reduce tasks: (0, C) -> [(C, 1)] (A, [1]) -> (A, 1) (2, B) -> [] (C, [1, 1, 1]) -> (C, 3) (4, B) -> [] (6, C) -> [(C, 1)] File2 (0, C) -> [(C, 1)] (2, A) -> [(A, 1)]
  • 25. Case 3: Max Speed Serve -Data Analysis Needed: for all professional tennis tournaments the past 3 years, process log files to determine fastest speed serve each year. Map function (enumerate speeds for each year): - input is (file offset, Year Speed) - output is a key-value pair [(Year,Speed)] Reduce function (determine max speed each year): - input is (Year, [speed1, … speedN]) - output is (Year, Speed) where Speed is the fastest recorded that year.
  • 26. Max Speed Serve Result File 1 File 2 2008 134 2008 136 2008- 136 2009 126 2009 127 2009- 132 2009 132 2010 124 2010- 124 Map tasks: Reduce tasks: [(2008,136)] (2008, [136, 134]) -> (2008,136) [(2009,126)] (2009,[126,132,127]) -> (2009,132) [(2009,132)] [(2008,134)] (2010,[124] -> (2010,124) [(2009,127)] [(2010,124)] * Will drop value when using MR Combiner functionality
  • 27. Case 4: Word Proximity Find occurrences of pairs of words where word1 is located within 4 words of word2. Map function (assign a value of 1 to every match): - input is (file offset, various text) - output is a key-value pair [(word1|word2,1)] Reduce function (total count per match): - input is (word1|word2, [1,1,1]) - output is (word1|word2, count)
  • 28. Word Proximity File 1 File 2 Result i have a piece of the pie it is a piece of cake; it piece|pie,1 doesn’t even look like pie Word1 = “piece” Word2 = “pie” Map tasks: (0,i have a piece of the pie)  (piece|pie,1) (0,it is a piece of cake; it doesn’t even look like pie)  () Reduce tasks: (piece|pie, [1]) -> (piece|pie,1)
  • 29. Case 5: Reverse Web-Link Graph Given a list of website home pages (W1…W4) and every link on that page, point the destination sites back to the original source web site. Map function - input is (adjacency list in format source: dest1, dest2..) - output is a key-value pair[dest,source] Reduce function (create adjacency list with dest as key): -input/output is (dest,[source1, source2])
  • 30. Link Reversal Input: Adjacency List Output: reversed list W1: W2,W4 W1: W2,W4 W2: W1,W3,W4 W2: W1 W3: W4 W3: W2,W4 W4: W1,W3 W4: W1,W2,W3 Map tasks: Reduce tasks: (W1,W2) -> (W2,W1) (W1,[W2,W4]) (W1,W4) -> (W4,W1) (W2,[W1] (W2,W1) -> (W1,W2) MR Library groups (W3,[W2,W4] (W2,W3) -> (W3,W2) intermediate keys (W2,W4) -> (W4,W2) and values in (W4,[W1,W2,W3] “Shuffle Phase” (W3,W4) -> (W4,W3) (W4,W1) -> (W1,W4) (W4,W3) -> (W3,W4)
  • 31. Why Use MapReduce? • Hides messy details of distributed infrastructure • MapReduce simplifies programming paradigm to allow for easy parallel execution • Easily scales to thousands of machines
  • 32. MapReduce Jobs Run @ Google [15] Aug. '04 Mar. '06 Sep. '07 Number of jobs (1000s) 29 171 2,217 Avg. completion time (secs) 634 874 395 Machine years used 217 2,002 11,081 map input data (TB) 3,288 52,254 403,152 map output data (TB) 758 6,743 34,774 reduce output data (TB) 193 2,970 14,018 Avg. machines per job 157 268 394 Unique implementations map 395 1958 4083 reduce 269 1208 2418
  • 34. Why Not Use A Parallel DBMS? • Parallel DBMS: – multiple CPUs, multiple servers – classic parallel programming concepts – HUGE established industry $$$ • Parallel DBMS Vendors – Teradata (NCR), DB2 (IBM), Oracle (via exadata), Greenplum, Vertica etc.
  • 35. “MapReduce is a Major Step Backward” Stonebraker & Dewitt attack on MR (1/17/08) [10,11] – a step backwards in database access – a poor implementation – not novel – missing features – incompatible with DBMS tools
  • 36. “Comparison of Approaches to Large-Scale Data Analysis” Stonebraker Dewitt Comparison of Hadoop MR vs Vertica & DBMS-X (7/2009) [12] – Hadoop • easy to install, get up & running • Maintenance of apps harder • Good for fault tolerance in queries • Slow because of reading entire file each time & pull of file on reduce step – Vertica & DBMS-X • much faster than Hadoop because of indexes, schema, column orientation, compression & “warm start-up at boot time”.
  • 37. “MapReduce and Parallel DBMSs: Friends or Foes?” Dewitt & Stonebraker update their position (1/2010) [13] – Hadoop MR and Parallel DBMS are complementary – Use Hadoop MR for subsets of tasks – Use Parallel DBMS for all other applications – Hadoop still needs significant improvements
  • 38. “MapReduce: A Flexible Data Processing Tool” Jeffrey Dean & Sanjay Ghemawat (Google) rebuttal (1/2010) [14] – MR can input data from heterogenous environments – MR can use indices as input to MR – Useful for Complex functions – “Protocol Buffers” parse much faster – MR pull model non-negotiable – Addresses performance concerns
  • 39. Conclusions • Hadoop MapReduce solid choice for leveraging power of Cloud Computing when tackling specific parallel data processing tasks; use PDBMS for all other tasks. • MR and PDBMS can learn from each other • Open source Hadoop MR continues to gain ground on performance and efficiency • Battle of MR vs PDBMS subsiding for now
  • 41. References [1] http://csrc.nist.gov/groups/SNS/cloud-computing/cloud-def-v15.doc [2] http://en.wikipedia.org/wiki/File:Cloud_computing.svg [3] http://news.cnet.com/8301-19413_3-10140278-240.html?tag=mncol;txt [4] http://rationalsecurity.typepad.com/blog/2009/01/cloud-computing-taxonomy-ontology-please- review.html [5] http://www.opencrowd.com/views/cloud.php/2Security [6] http://berkeleyclouds.blogspot.com [7] Forrester Research, Talking to Your CFO About Cloud Computing, Ted Schadler; Oct. 29, 2008. [8] http://code.google.com/edu/parallel/mapreduce-tutorial.html [9] http://labs.google.com/papers/mapreduce.html [10] http://databasecolumn.vertica.com/database-innovation/mapreduce-a-major-step-backwards/ [11] http://databasecolumn.vertica.com/database-innovation/mapreduce-ii/ [12] “Comparison of Approaches to Large-Scale Data Analysis”, Pavlo, Abadi, Stonebraker, Dewitt , et al (7/2009) [13] ACM, “MapReduce and Parallel DBMSs: Friends or Foes?”, Stonebraker, Abadi, Dewitt, et al (1/2010) [14] ACM, “MapReduce: A Flexible Data Processing Tool”, Jeffrey Dean & Sanjay Ghemawat (1/2010) [15] http://googlesystem.blogspot.com/2008/01/google-reveals-more-mapreduce-stats.html