Presentation to University of Kentucky Computer Science graduate studentrs on high level Cloud Computing, how MapReduce works, and the current competition for Parallel Processing on a Massive Scale
1. Cloud Computing & MapReduce:
Parallel Processing on a Massive Scale
Geoff Rothman (rothman@hp.com)
March 27, 2010
2. Outline
1. Overview of Cloud Computing
– Establish a general definition
2. Overview of Google MapReduce
– Parallel programming with Cloud Computing
3. Debate between MapReduce & Parallel DBMS
– Is one better than the other or are they
complementary?
7. “SPI Model - as a Service”
• Software as a Service (SaaS):
– Application system (Salesforce, WebEx)
• Platform as a Service (PaaS):
– Infrastructure pre-existing; simply code and deploy
(Google AppEngine, MS Azure, Force.com)
• Infrastructure as a Service (IaaS):
– Raw infrastructure, servers and storage provided on-
demand (Amazon Web Services, GoGrid) [3]
10. Cloud Deployment Models
• Private
– Single tenant, owned and managed by company or service provider
either on or off-premise; consumers are trusted
• Public
– Single or multitenant (shared), owned by service provider off-premise;
consumers are untrusted
• Managed
– Single or multi-tenant (shared), located in org’s datacenter but
managed and secured by Service Provider; consumers are trusted or
untrusted
• Hybrid
– Combination of public/private offering; “cloud burst”; consumers are
trusted or untrusted
11. Why use the Cloud? CFO View
• Operational vs Capital Expenditures
• Better Cash Flow
• Limited Financial Risk
• Better Balance Sheet
• Outsource non-core competencies [7]
12. Why Use the Cloud? CIO View
• Analytics
• Parallel Batch Processing
• Compute intensive desktops apps [6]
• Mobile Interactive Apps (GUI for mashups) [6]
• Webserver uptime / redundancy
• Accelerate project rollouts
14. Cloud Computing & Parallel Batch
Processing: Overview of Map/Reduce
• Developed by Google to perform simple
computations on massive amounts of data
( > 1TB) in a substantially reduced amount of time
• Hides details for
– Parallelization
– Data distribution
– Load balancing
– Fault tolerance
15. MapReduce Programming Model [8]
Input & Output: each a set of key/value pairs
Code two functions: map & reduce
map (in_key, in_value) -> list(out_key, intermediate_value)
• Processes input key/value pair
• Produces set of intermediate pairs
reduce (out_key, list(intermediate_value)) -> list(out_value)
• Combines all intermediate values for a particular key
• Produces a set of merged output values (usually just one)
16. Case 1: Word Count
Determine frequency of words in a file.
Map function (assign a value of 1 to every word):
- input is (file offset, various text)
- output is a key-value pair [(word,1)]
MR Library Shuffle Step takes Map Output and groups by Keys by
Hash function.
Reduce function (total counts per word):
- input is (word, [1,1,1])
- output is (word, count)
17. Word Count – Sample Code [9]
map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, "1");
reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
18. Word Count Result
File 1 i love to code File 2 to code is to love code,2
i,1
is,1
Map tasks: love,2
Reduce tasks: to,3
Map1 Reducer1 File1
[(i,1)]
[(love,1)] (code, [1,1]) -> (code,2)
[(to,1)] (i, [1]) -> (i,1)
[(code,1)] (is,[1]) -> (is,1)
Map2 MR Library groups
[(to,1)] intermediate keys Reducer2 File2
[(code,1)] and values in (love,[1,1]) -> (love,2)
[(is,1)] “Shuffle Phase” (to,[1,1,1]) -> (to,3)
[(to,1)]
[(love,1]
* File2 will have a key value pair of (to,2) after map when using MR Combiner functionality
19. MapReduce Features
• Fault Tolerance
• Redundant Execution
• Locality Optimization
• Skip Bad Records
• Sort before Reduce
• Combiner
23. Case 2: Distributed Grep
Counts lines in all files that match a <regex> and displays counts.
Other uses include: analyzing web server access logs to find the
top requested pages that match a given pattern
Map function (establish a match):
- input is (file offset, char)
- output is either:
1. an empty list [] (the line does not match ‘A’ or ‘C’)
2. a key-value pair [(line, 1)] (if it matches)
Reduce function (total counts):
- input is (char, [1, 1, ...])
- output is (char, n) where n is the number of 1s in the list.
http://web.cs.wpi.edu/~cs4513/d08/OtherStuff/MapReduce-TeamC.ppt
24. Distributed Grep
File 1 C File 2 Result
C
B A 3C
B
1A
C
Map tasks:
File1 Reduce tasks:
(0, C) -> [(C, 1)] (A, [1]) -> (A, 1)
(2, B) -> [] (C, [1, 1, 1]) -> (C, 3)
(4, B) -> []
(6, C) -> [(C, 1)]
File2
(0, C) -> [(C, 1)]
(2, A) -> [(A, 1)]
25. Case 3: Max Speed Serve
-Data Analysis Needed: for all professional tennis tournaments
the past 3 years, process log files to determine fastest speed
serve each year.
Map function (enumerate speeds for each year):
- input is (file offset, Year Speed)
- output is a key-value pair [(Year,Speed)]
Reduce function (determine max speed each year):
- input is (Year, [speed1, … speedN])
- output is (Year, Speed) where Speed is the fastest recorded
that year.
26. Max Speed Serve
Result
File 1 File 2 2008 134
2008 136 2008- 136
2009 126 2009 127
2009- 132
2009 132 2010 124
2010- 124
Map tasks: Reduce tasks:
[(2008,136)]
(2008, [136, 134]) -> (2008,136)
[(2009,126)]
(2009,[126,132,127]) -> (2009,132)
[(2009,132)]
[(2008,134)] (2010,[124] -> (2010,124)
[(2009,127)]
[(2010,124)]
* Will drop value when using MR Combiner functionality
27. Case 4: Word Proximity
Find occurrences of pairs of words where word1 is located within
4 words of word2.
Map function (assign a value of 1 to every match):
- input is (file offset, various text)
- output is a key-value pair [(word1|word2,1)]
Reduce function (total count per match):
- input is (word1|word2, [1,1,1])
- output is (word1|word2, count)
28. Word Proximity
File 1 File 2 Result
i have a piece of the pie it is a piece of cake; it piece|pie,1
doesn’t even look like
pie
Word1 = “piece” Word2 = “pie”
Map tasks:
(0,i have a piece of the pie) (piece|pie,1)
(0,it is a piece of cake; it doesn’t even look like pie) ()
Reduce tasks:
(piece|pie, [1]) -> (piece|pie,1)
29. Case 5: Reverse Web-Link Graph
Given a list of website home pages (W1…W4) and every link on
that page, point the destination sites back to the original source
web site.
Map function
- input is (adjacency list in format source: dest1, dest2..)
- output is a key-value pair[dest,source]
Reduce function (create adjacency list with dest as key):
-input/output is (dest,[source1, source2])
31. Why Use MapReduce?
• Hides messy details of distributed infrastructure
• MapReduce simplifies programming paradigm to
allow for easy parallel execution
• Easily scales to thousands of machines
32. MapReduce Jobs Run @ Google [15]
Aug. '04 Mar. '06 Sep. '07
Number of jobs (1000s) 29 171 2,217
Avg. completion time (secs) 634 874 395
Machine years used 217 2,002 11,081
map input data (TB) 3,288 52,254 403,152
map output data (TB) 758 6,743 34,774
reduce output data (TB) 193 2,970 14,018
Avg. machines per job 157 268 394
Unique implementations
map 395 1958 4083
reduce 269 1208 2418
34. Why Not Use A Parallel DBMS?
• Parallel DBMS:
– multiple CPUs, multiple servers
– classic parallel programming concepts
– HUGE established industry $$$
• Parallel DBMS Vendors
– Teradata (NCR), DB2 (IBM), Oracle (via exadata),
Greenplum, Vertica etc.
35. “MapReduce is a Major Step Backward”
Stonebraker & Dewitt attack on MR (1/17/08) [10,11]
– a step backwards in database access
– a poor implementation
– not novel
– missing features
– incompatible with DBMS tools
36. “Comparison of Approaches to Large-Scale Data
Analysis”
Stonebraker Dewitt Comparison of Hadoop MR vs Vertica &
DBMS-X (7/2009) [12]
– Hadoop
• easy to install, get up & running
• Maintenance of apps harder
• Good for fault tolerance in queries
• Slow because of reading entire file each time & pull of file on
reduce step
– Vertica & DBMS-X
• much faster than Hadoop because of indexes, schema,
column orientation, compression & “warm start-up at boot
time”.
37. “MapReduce and Parallel DBMSs: Friends or Foes?”
Dewitt & Stonebraker update their position (1/2010) [13]
– Hadoop MR and Parallel DBMS are complementary
– Use Hadoop MR for subsets of tasks
– Use Parallel DBMS for all other applications
– Hadoop still needs significant improvements
38. “MapReduce: A Flexible Data Processing Tool”
Jeffrey Dean & Sanjay Ghemawat (Google) rebuttal (1/2010) [14]
– MR can input data from heterogenous environments
– MR can use indices as input to MR
– Useful for Complex functions
– “Protocol Buffers” parse much faster
– MR pull model non-negotiable
– Addresses performance concerns
39. Conclusions
• Hadoop MapReduce solid choice for leveraging
power of Cloud Computing when tackling specific
parallel data processing tasks; use PDBMS for all
other tasks.
• MR and PDBMS can learn from each other
• Open source Hadoop MR continues to gain ground
on performance and efficiency
• Battle of MR vs PDBMS subsiding for now