Geoff Rothman Presentation on Parallel Processing

Cloud Computing & MapReduce:
Parallel Processing on a Massive Scale
Geoff Rothman (rothman@hp.com)
March 27, 2010

Outline
1. Overview of Cloud Computing
– Establish a general definition

2. Overview of Google MapReduce
– Parallel programming with Cloud Computing

3. Debate between MapReduce & Parallel DBMS
– Is one better than the other or are they
complementary?

Cloud Computing: What Does It Mean?
• On-demand network access to shared pool of
configurable computing resources [1]

[2]

NIST View of Cloud Computing

• Five characteristics

• Three service models

• Four deployment models

Cloud Computing Characteristics

• On-Demand & Automated

• Broad network access

• Resource Pooling

• Rapid Elasticity

• Measured Service

“SPI Model - as a Service”
• Software as a Service (SaaS):
– Application system (Salesforce, WebEx)

• Platform as a Service (PaaS):
– Infrastructure pre-existing; simply code and deploy
(Google AppEngine, MS Azure, Force.com)

• Infrastructure as a Service (IaaS):
– Raw infrastructure, servers and storage provided on-
demand (Amazon Web Services, GoGrid) [3]

Cloud Deployment Models
• Private
– Single tenant, owned and managed by company or service provider
either on or off-premise; consumers are trusted

• Public
– Single or multitenant (shared), owned by service provider off-premise;
consumers are untrusted

• Managed
– Single or multi-tenant (shared), located in org’s datacenter but
managed and secured by Service Provider; consumers are trusted or
untrusted

• Hybrid
– Combination of public/private offering; “cloud burst”; consumers are
trusted or untrusted

Why use the Cloud? CFO View

• Operational vs Capital Expenditures

• Better Cash Flow

• Limited Financial Risk

• Better Balance Sheet

• Outsource non-core competencies [7]

Why Use the Cloud? CIO View
• Analytics
• Parallel Batch Processing
• Compute intensive desktops apps [6]
• Mobile Interactive Apps (GUI for mashups) [6]
• Webserver uptime / redundancy
• Accelerate project rollouts

Cloud Computing & Parallel Batch
Processing: Overview of Map/Reduce
• Developed by Google to perform simple
computations on massive amounts of data
( > 1TB) in a substantially reduced amount of time

• Hides details for
– Parallelization
– Data distribution
– Load balancing
– Fault tolerance

MapReduce Programming Model [8]
Input & Output: each a set of key/value pairs

Code two functions: map & reduce

map (in_key, in_value) -> list(out_key, intermediate_value)
• Processes input key/value pair
• Produces set of intermediate pairs

reduce (out_key, list(intermediate_value)) -> list(out_value)
• Combines all intermediate values for a particular key
• Produces a set of merged output values (usually just one)

Case 1: Word Count
Determine frequency of words in a file.

Map function (assign a value of 1 to every word):
- input is (file offset, various text)
- output is a key-value pair [(word,1)]

MR Library Shuffle Step takes Map Output and groups by Keys by
Hash function.

Reduce function (total counts per word):
- input is (word, [1,1,1])
- output is (word, count)

Word Count – Sample Code [9]
map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, "1");

reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));

Word Count Result
File 1 i love to code File 2 to code is to love code,2
i,1
is,1
Map tasks: love,2
Reduce tasks: to,3
Map1 Reducer1 File1
[(i,1)]
[(love,1)] (code, [1,1]) -> (code,2)
[(to,1)] (i, [1]) -> (i,1)
[(code,1)] (is,[1]) -> (is,1)
Map2 MR Library groups
[(to,1)] intermediate keys Reducer2 File2
[(code,1)] and values in (love,[1,1]) -> (love,2)
[(is,1)] “Shuffle Phase” (to,[1,1,1]) -> (to,3)
[(to,1)]
[(love,1]

* File2 will have a key value pair of (to,2) after map when using MR Combiner functionality

MapReduce Features
• Fault Tolerance
• Redundant Execution
• Locality Optimization
• Skip Bad Records
• Sort before Reduce
• Combiner

Map & Reduce Parallel Execution [8]

Case 2: Distributed Grep
Counts lines in all files that match a <regex> and displays counts.

Other uses include: analyzing web server access logs to find the
top requested pages that match a given pattern

Map function (establish a match):
- input is (file offset, char)
- output is either:
1. an empty list [] (the line does not match ‘A’ or ‘C’)
2. a key-value pair [(line, 1)] (if it matches)

Reduce function (total counts):
- input is (char, [1, 1, ...])
- output is (char, n) where n is the number of 1s in the list.

http://web.cs.wpi.edu/~cs4513/d08/OtherStuff/MapReduce-TeamC.ppt

Distributed Grep
File 1 C File 2 Result
C
B A 3C
B
1A
C

Map tasks:
File1 Reduce tasks:
(0, C) -> [(C, 1)] (A, [1]) -> (A, 1)
(2, B) -> [] (C, [1, 1, 1]) -> (C, 3)
(4, B) -> []
(6, C) -> [(C, 1)]

File2
(0, C) -> [(C, 1)]
(2, A) -> [(A, 1)]

Case 3: Max Speed Serve
-Data Analysis Needed: for all professional tennis tournaments
the past 3 years, process log files to determine fastest speed
serve each year.

Map function (enumerate speeds for each year):
- input is (file offset, Year Speed)
- output is a key-value pair [(Year,Speed)]

Reduce function (determine max speed each year):
- input is (Year, [speed1, … speedN])
- output is (Year, Speed) where Speed is the fastest recorded
that year.

Max Speed Serve
Result
File 1 File 2 2008 134
2008 136 2008- 136
2009 126 2009 127
2009- 132
2009 132 2010 124
2010- 124

Map tasks: Reduce tasks:
[(2008,136)]
(2008, [136, 134]) -> (2008,136)
[(2009,126)]
(2009,[126,132,127]) -> (2009,132)
[(2009,132)]
[(2008,134)] (2010,[124] -> (2010,124)
[(2009,127)]
[(2010,124)]

* Will drop value when using MR Combiner functionality

Case 4: Word Proximity
Find occurrences of pairs of words where word1 is located within
4 words of word2.

Map function (assign a value of 1 to every match):
- input is (file offset, various text)
- output is a key-value pair [(word1|word2,1)]

Reduce function (total count per match):
- input is (word1|word2, [1,1,1])
- output is (word1|word2, count)

Word Proximity
File 1 File 2 Result
i have a piece of the pie it is a piece of cake; it piece|pie,1
doesn’t even look like
pie

Word1 = “piece” Word2 = “pie”

Map tasks:
(0,i have a piece of the pie)  (piece|pie,1)
(0,it is a piece of cake; it doesn’t even look like pie)  ()

Reduce tasks:
(piece|pie, [1]) -> (piece|pie,1)

Case 5: Reverse Web-Link Graph
Given a list of website home pages (W1…W4) and every link on
that page, point the destination sites back to the original source
web site.

Map function
- input is (adjacency list in format source: dest1, dest2..)
- output is a key-value pair[dest,source]

Reduce function (create adjacency list with dest as key):
-input/output is (dest,[source1, source2])

Link Reversal
Input: Adjacency List Output: reversed list
W1: W2,W4 W1: W2,W4
W2: W1,W3,W4 W2: W1
W3: W4 W3: W2,W4
W4: W1,W3 W4: W1,W2,W3

Map tasks: Reduce tasks:
(W1,W2) -> (W2,W1)
(W1,[W2,W4])
(W1,W4) -> (W4,W1)
(W2,[W1]
(W2,W1) -> (W1,W2) MR Library groups
(W3,[W2,W4]
(W2,W3) -> (W3,W2) intermediate keys
(W2,W4) -> (W4,W2) and values in (W4,[W1,W2,W3]
“Shuffle Phase”
(W3,W4) -> (W4,W3)
(W4,W1) -> (W1,W4)
(W4,W3) -> (W3,W4)

Why Use MapReduce?

• Hides messy details of distributed infrastructure

• MapReduce simplifies programming paradigm to
allow for easy parallel execution

• Easily scales to thousands of machines

MapReduce Jobs Run @ Google [15]

Aug. '04 Mar. '06 Sep. '07
Number of jobs (1000s) 29 171 2,217
Avg. completion time (secs) 634 874 395
Machine years used 217 2,002 11,081
map input data (TB) 3,288 52,254 403,152
map output data (TB) 758 6,743 34,774
reduce output data (TB) 193 2,970 14,018
Avg. machines per job 157 268 394
Unique implementations
map 395 1958 4083
reduce 269 1208 2418

Current Debate:
MapReduce vs Parallel DBMS

Why Not Use A Parallel DBMS?
• Parallel DBMS:
– multiple CPUs, multiple servers
– classic parallel programming concepts
– HUGE established industry $$$

• Parallel DBMS Vendors
– Teradata (NCR), DB2 (IBM), Oracle (via exadata),
Greenplum, Vertica etc.

“MapReduce is a Major Step Backward”

Stonebraker & Dewitt attack on MR (1/17/08) [10,11]
– a step backwards in database access
– a poor implementation
– not novel
– missing features
– incompatible with DBMS tools

“Comparison of Approaches to Large-Scale Data
Analysis”
Stonebraker Dewitt Comparison of Hadoop MR vs Vertica &
DBMS-X (7/2009) [12]

– Hadoop
• easy to install, get up & running
• Maintenance of apps harder
• Good for fault tolerance in queries
• Slow because of reading entire file each time & pull of file on
reduce step
– Vertica & DBMS-X
• much faster than Hadoop because of indexes, schema,
column orientation, compression & “warm start-up at boot
time”.

“MapReduce and Parallel DBMSs: Friends or Foes?”

Dewitt & Stonebraker update their position (1/2010) [13]

– Hadoop MR and Parallel DBMS are complementary

– Use Hadoop MR for subsets of tasks

– Use Parallel DBMS for all other applications

– Hadoop still needs significant improvements

“MapReduce: A Flexible Data Processing Tool”
Jeffrey Dean & Sanjay Ghemawat (Google) rebuttal (1/2010) [14]

– MR can input data from heterogenous environments

– MR can use indices as input to MR

– Useful for Complex functions

– “Protocol Buffers” parse much faster

– MR pull model non-negotiable

– Addresses performance concerns

Conclusions
• Hadoop MapReduce solid choice for leveraging
power of Cloud Computing when tackling specific
parallel data processing tasks; use PDBMS for all
other tasks.

• MR and PDBMS can learn from each other

• Open source Hadoop MR continues to gain ground
on performance and efficiency

• Battle of MR vs PDBMS subsiding for now

References
[1] http://csrc.nist.gov/groups/SNS/cloud-computing/cloud-def-v15.doc
[2] http://en.wikipedia.org/wiki/File:Cloud_computing.svg
[3] http://news.cnet.com/8301-19413_3-10140278-240.html?tag=mncol;txt
[4] http://rationalsecurity.typepad.com/blog/2009/01/cloud-computing-taxonomy-ontology-please-
review.html
[5] http://www.opencrowd.com/views/cloud.php/2Security
[6] http://berkeleyclouds.blogspot.com
[7] Forrester Research, Talking to Your CFO About Cloud Computing, Ted Schadler; Oct. 29, 2008.
[8] http://code.google.com/edu/parallel/mapreduce-tutorial.html
[9] http://labs.google.com/papers/mapreduce.html
[10] http://databasecolumn.vertica.com/database-innovation/mapreduce-a-major-step-backwards/
[11] http://databasecolumn.vertica.com/database-innovation/mapreduce-ii/
[12] “Comparison of Approaches to Large-Scale Data Analysis”, Pavlo, Abadi, Stonebraker, Dewitt , et al
(7/2009)
[13] ACM, “MapReduce and Parallel DBMSs: Friends or Foes?”, Stonebraker, Abadi, Dewitt, et al (1/2010)
[14] ACM, “MapReduce: A Flexible Data Processing Tool”, Jeffrey Dean & Sanjay Ghemawat (1/2010)
[15] http://googlesystem.blogspot.com/2008/01/google-reveals-more-mapreduce-stats.html

Geoff Rothman Presentation on Parallel Processing

Recommended

Recommended

More Related Content

Similar to Geoff Rothman Presentation on Parallel Processing

Similar to Geoff Rothman Presentation on Parallel Processing (20)

Geoff Rothman Presentation on Parallel Processing