2. Engineer, Not Academic
• Concurrent, Inc., Founder
• Cascading support and tools
• http://concurrentinc.com/
• Cascading, Lead Developer (started Sept 2007)
• An alternative API to MapReduce
• http://cascading.org/
• Formerly Hadoop mentoring and training
• Sun - Apple - HP - LexisNexis - startups - etc
• Formerly Systems Architect & Consultant
• Thomson/Reuters - TeleAtlas - startups - etc
Copyright Concurrent, Inc. 2011. All rights reserved.
3. Overview
• MapReduce
• Heavy Lifting
• Analytics
• Optimizations
Copyright Concurrent, Inc. 2011. All rights reserved.
4. MapReduce
• A “divide and conquer” strategy for parallelizing
workloads against collections of data
• Map & Reduce are two user defined functions
chained via Key Value Pairs
• It’s really Map->Group->Reduce where Group is
built in
Copyright Concurrent, Inc. 2011. All rights reserved.
5. Keys and Values
• Map translates input to keys
and values to new keys and
values [K1,V1] Map [K2,V2]*
• System Groups each unique [K2,V2] Group [K2,{V2,V2,....}]
key with all its values
[K2,{V2,V2,....}] Reduce [K3,V3]*
• Reduce translates the values
of each unique key to new
keys and values * = zero or more
Copyright Concurrent, Inc. 2011. All rights reserved.
6. Word Count
Mapper
[0, "when in the course of
human events"] Map ["when",1] ["in",1] ["the",1] [...,1]
["when",1]
["when",1]
["when",1]
["when",1] Group ["when",{1,1,1,1,1}]
["when",1]
Reducer
["when",{1,1,1,1,1}] Reduce ["when",5]
Copyright Concurrent, Inc. 2011. All rights reserved.
7. Divide and Conquer
Parallelism
• Since the ‘records’ entering the Map and ‘groups’
entering the Reduce are independent
• That is, there is no expectation of order or
requirement to share state between records/
groups
• Arbitrary numbers of Map and Reduce function
instances can be created against arbitrary portions
of input data
Copyright Concurrent, Inc. 2011. All rights reserved.
8. Cluster
Cluster
Rack Rack Rack
Node Node Node Node ...
map map map map map
reduce reduce reduce
• Multiple instances of each Map and Reduce
function are distributed throughout the cluster
Copyright Concurrent, Inc. 2011. All rights reserved.
9. Another View
[K1,V1] Map [K2,V2]
Combine Group [K2,{V2,...}] Reduce [K3,V3]
Mapper
Task same code
Mapper Reducer
Shuffle
Task Task
Mapper Reducer
Shuffle
Task Task
Mapper Reducer
Shuffle Task
Task
Mapper
Task
Mappers must
complete before
Reducers can
begin
split1 split2 split3 split4 ... part-00000 part-00001 part-000N
file directory
Copyright Concurrent, Inc. 2011. All rights reserved.
10. Complex job
assemblies
• Real applications are many MapReduce jobs chained together
• Linked by intermediate (usually temporary) files
• Executed in order, by hand, from the ‘client’ application
Count Job Sort Job
[ k, [v] ] [ k, [v] ]
Map Reduce Map Reduce
[ k, v ] [ k, v ] [ k, v ] [ k, v ]
File File File
[ k, v ] = key and value pair
[ k, [v] ] = key and associated values collection
Copyright Concurrent, Inc. 2011. All rights reserved.
12. Heavy Lifting
• Thing we must do because data can be heavy
• These patterns are natural to MapReduce and easy to implement
• But have some room for composition/aggregation within a Map/
Reduce (i.e., Filter + Binning)
• (leading us to think of Hadoop as an ETL framework)
• Record Filtering
• Parsing, Conversion • Binning
• Counting, Summing • Distributed Tasks
• Unique
Copyright Concurrent, Inc. 2011. All rights reserved.
13. Record Filtering
• Think unix ‘grep’
• Filtering is discarding unwanted values (or
preserving wanted)
• Only uses a Map function, no Reducer
Copyright Concurrent, Inc. 2011. All rights reserved.
14. Parsing, Conversion
• Think unix ‘sed’
• A Map function that takes an input key and/or value and
translates it into a new format
• Examples:
• raw logs to delimited text or archival efficient binary
• entity extraction
Copyright Concurrent, Inc. 2011. All rights reserved.
15. Counting, Summing
• The same as SQL aggregation functions
• Simply applying some function to the values
collection seen in Reduce
• Other examples:
• average, max, min, unique
Copyright Concurrent, Inc. 2011. All rights reserved.
16. Merging
• Where many files of the same type are converted to one
output path
• Map side merges
• One directory with as many part files as Mappers
• Reduce side merges
• Allows for removing duplicates or deleted items
• One directory with as many part files as Reducers
• Examples
• Nutch
• Normalizing log files (apache, log4j, etc)
Copyright Concurrent, Inc. 2011. All rights reserved.
17. Binning
• Where the values associated w/ unique keys are
persisted together
• Typically a directory path based on key’s value
• Must be conscious of total open files, remember no
appends
• Examples:
• web log files by year/month/day
• trade data by symbol
Copyright Concurrent, Inc. 2011. All rights reserved.
18. Distributed Tasks
• Simply where a Map or Reduce function executes some
‘task’ based on the input key and value.
• Examples:
• web crawling,
• load testing services,
• rdbms/nosql updates,
• file transfers (S3),
• image to pdf (NYT on EC2)
Copyright Concurrent, Inc. 2011. All rights reserved.
19. Basic Analytic Patterns
• Some of these patterns are unnatural to MapReduce
• We think in terms of columns/fields, not key value
pairs
• (leading us to think of Hadoop as a RDBMS)
• Group By
• Secondary Unique
• Unique
• CoGrouping and Joining
• Secondary Sort
Copyright Concurrent, Inc. 2011. All rights reserved.
20. Composite Keys/Values
[K1,V1] <A1,B1,C1,...>
• It is easier to think in columns/fields
• e.g. “firstname” & “lastname”, not “line”
• Whether a set of columns are Keys or Values is
arbitrary
• Keys become a means to piggyback the
properties of MR and become an impl detail
Copyright Concurrent, Inc. 2011. All rights reserved.
21. Group By
GroupBy
1001
Jim
dept_id
Mary
name Susan
1002
Fred
Wilma
Ernie
Barny
• Group By is where Value fields are grouped by Grouping fields
• Above, Map output key is “dept_id” and value is “name”
Copyright Concurrent, Inc. 2011. All rights reserved.
22. Group By
Mapper Reducer
Piggyback Code [K1,V1] [K2,{V2,V2,....}]
[K1,V1] -> <A1,B1,C1,D1> [K2,V2] -> <A2,B2,{<C2,D2>,...}>
User Code
Map Reduce
<A2,B2> -> K2, <C2,D2> -> V2 <A3,B3> -> K3, <C3,D3> -> V3
[K2,V2] [K3,V3]
• So the K2 key becomes a composite Key of
• key: [grouping], value: [values]
Copyright Concurrent, Inc. 2011. All rights reserved.
23. Unique
Mapper
[0, "when in the course of
human events"] Map ["when",null] ["in",null] [...,null]
["when",1]
["when",1]
["when",1]
["when",1] Group ["when",{nulls}]
["when",null]
Reducer
["when",{nulls}] Reduce ["when",null]
• Or Distinct (as in SQL)
• Globally finding all the unique values in a dataset
• Usually finding unique values in a column
• Often used to filter a second dataset using a join
Copyright Concurrent, Inc. 2011. All rights reserved.
24. Secondary Sort
(group) (sorted value) (remaining value)
Date Time Url
08/08/2008, 1:00:00, http://www.example.com/foo
08/08/2008, 1:01:00, http://www.example.com/bar
08/08/2008, 1:01:30, http://www.example.com/baz
• Secondary Sorting is where
• Some Fields are grouped on, and
• Some of the remaining Fields are sorted within
their grouping
Copyright Concurrent, Inc. 2011. All rights reserved.
25. Secondary Sort
Mapper Reducer
[K1,V1] [K2,{V2,V2,....}]
[K1,V1] -> <A1,B1,C1,D1> [K2,V2] -> <A2,B2,{<C2,D2>,...}>
Map Reduce
<A2,B2><C2> -> K2, <D2> -> V2 <A3,B3> -> K3, <C3,D3> -> V3
[K2,V2] [K3,V3]
• So the K2 key becomes a composite Key of
• key: [grouping, secondary], value: [remaining values]
• The trick is to piggyback the Reduce sort yet not be compared
during the unique key comparison
Copyright Concurrent, Inc. 2011. All rights reserved.
26. Secondary Unique
Mapper Assume Secondary Sorting
magic happens here
[0, "when in the course of
human events"] Map [0, "when"] [0, "in"] [0,"the"] [0,...]
["when",1]
["when",1]
["when",1]
["when",1] Group [0,{"in","in","the","when","when",...}]
[0,"when"]
Reducer
[0,{"in","in","the","when","when",...}] Reduce ["in",null] ["the",null] ["when",null]
• Secondary Unique is where the grouping values are uniqued
• .... in a “scale free” way
• Perform a Secondary Sort...
• Reducer removes duplicates by discarding every value that
matches the previous value
• since values are now ordered, no need to maintain a Set of
values
Copyright Concurrent, Inc. 2011. All rights reserved.
27. Joining
lhs data
rhs data
1001
dept_id Jim Accounting
dept_name
Mary Accounting
name Susan Accounting
1002
Fred Shipping
Wilma Shipping
Ernie Shipping
Barny Shipping
• Where two or more input data sets are ‘joined’ by a
common key
• Like a SQL join
Copyright Concurrent, Inc. 2011. All rights reserved.
28. Join Definitions
• Consider the input data [key, value]:
• LHS = [0,a] [1,b] [2,c]
• RHS = [0,A] [2,C] [3,D]
• Joins on the key:
• Inner
• [0,a,A] [2,c,C]
• Outer (Left Outer, Right Outer)
• [0,a,A] [1,b,null] [2,c,C] [3,null,D]
• Left (Left Inner, Right Outer)
• [0,a,A] [1,b,null] [2,c,C]
• Right (Left Outer, Right Inner)
• [0,a,A] [2,c,C] [3,null,D]
Copyright Concurrent, Inc. 2011. All rights reserved.
29. CoGrouping
• Before Joining, CoGrouping must happen
• Simply concurrent GroupBy operations on each
input data set
Copyright Concurrent, Inc. 2011. All rights reserved.
30. GroupBy vs CoGroup
lhs data
rhs data
GroupBy CoGroup
1001 1001
Jim Jim Accounting
dept_id
Mary Mary
name Susan Susan dept_name
1002 1002
Fred Fred Shipping
Wilma Wilma
Ernie Ernie
Barny Barny
Independent collections
of unordered values
Copyright Concurrent, Inc. 2011. All rights reserved.
31. CoGroup Joined
lhs data
rhs data
1001
dept_id Jim Accounting
dept_name
Mary Accounting
name Susan Accounting
1002
Fred Shipping
Wilma Shipping
Ernie Shipping
Barny Shipping
• Considering the previous data, a typical Inner Join
Copyright Concurrent, Inc. 2011. All rights reserved.
32. CoGrouping
Mapper [n] [n+1] Reducer
[K1,V1] [K1',V1'] [K2,{V2,V2,....}]
[K2,V2] -> <A2,B2,{<C2,D2,C2',D2'>,...}>
[K1,V1] -> <A1,B1,C1,D1> [K1',V1'] -> <A1',B1',C1',D1'>
Reduce
Map
<A3,B3> -> K3, <C3,D3> -> V3
<A2,B2> -> K2, [n]<C2,D2> -> V2
[K2,V2] [K3,V3]
• Maps must run for each input set in same Job (n, n+1, etc)
• CoGrouping must happen against each common key
Copyright Concurrent, Inc. 2011. All rights reserved.
33. Joining
Reducer
[K2,{V2,V2,....}]
<A2,B2,{[n]<C2,D2>,[n+1]..}>
[K2,V2] -> <A2,B2,{<C2,D2,C2',D2'>,...}>
<A2,B2,{<C2,D2>,...},{<C2',D2'>,...}>
Reduce
{<C2,D2>,...} Join {<C2',D2'>,...}
<A3,B3> -> K3, <C3,D3> -> V3 <C2,D2,C2',D2'>
[K3,V3] <A2,B2,{<C2,D2,C2',D2'>,...}>
• The CoGroups must be joined
• Finally the Reduce can be applied
Copyright Concurrent, Inc. 2011. All rights reserved.
34. Optimizations
• Patterns for reducing IO
• Identity Mapper
• Partial Aggregates
• Map Side Join
• Similarity Joins
• Combiners
Copyright Concurrent, Inc. 2011. All rights reserved.
36. Map Side Joins
• Bypasses the (immediate) need for a Reducer
• Symmetrical
• Where LHS and RHS are of equivalent size
• Requires data to be sorted on key
• Asymmetrical
• One side is small enough to fit in memory
• Typically a hashtable lookup
Copyright Concurrent, Inc. 2011. All rights reserved.
37. Combiners
Mapper
[0, "when in the course of
human events"] Map ["when",1] ["in",1] ["the",1] [...,1]
Combiner
["when",1]
["when",1] Group ["when",{1,1}]
["when",{1,1}] Reduce ["when",2]
Same Implementation
["when",1]
["when",1] Group ["when",{2,1,2}]
["when",2]
Reducer
["when",{2,1,2}] Reduce ["when",5]
• Where Reduce runs Map side, and again Reduce side
• Only works if Reduce is commutative and associative
• Reduces bandwidth by trading CPU for IO
• Serialization/deserialization during local sorting before combining
Copyright Concurrent, Inc. 2011. All rights reserved.
38. Partial Aggregates
Mapper
[0, "when in the course of
human events"] ["when",1] ["in",1] ["the",1] [...,1]
Map
Partial
Provides an opportunity to
["when",1]
["when",1] ["when",2] promote the functionality of
the next Map to this Reduce
["when",1]
["when",1] Group ["when",{2,1,2}]
["when",2]
Reducer
["when",{2,1,2}] Reduce ["when",5]
• Supports any aggregate type, while being composable with other
aggregates
• Reduces bandwidth by trading Memory for IO
• Very important for a CPU constrained cluster
• Use a bounded LRU to keep constant memory (requires tuning)
Copyright Concurrent, Inc. 2011. All rights reserved.
39. Partial Aggregates
[a,b,c,a,a,b]
[a,b,c,a,a,b] partial unique
partial unique [a,b,c,a,b]
[a,b,c,a,b]
[a,b,c,a,a,b]
[a,b,c,a,a,b] partial unique
partial unique [a,b,c,a,b]
[a,b,c,a,b]
LRU*
{_,_}
*cache size of 2
a -> {a,_} -> _
b -> {b,a} -> _
incoming discarded
c -> {c,b} -> a
value value
a -> {a,c} -> b
a -> {a,c}
b -> {b,a} -> c
• OK that dupes emit from a Mapper and across
Mappers (or prev Reducers!)
• Final aggregation happens in Reducer
• Larger the cache, fewer dupes Copyright Concurrent, Inc. 2011. All rights reserved.
40. Tradeoffs
• CPU for IO == fault tolerance
• Memory for IO == performance
Copyright Concurrent, Inc. 2011. All rights reserved.
41. Similarity Join
• Compare all values LHS to values RHS to find
duplicates (or similar values)
• Naive approaches
• Cross Join (all data through one reducer)
• In-common features (very common features will
bottleneck)
Copyright Concurrent, Inc. 2011. All rights reserved.
42. Set-Similarity Joining
• “Efficient Parallel Set-Similarity Joins Using
MapReduce” - R Vernica, M Carey, C Li
• Only compare candidate pairs
• Candidates share uncommon features
Copyright Concurrent, Inc. 2011. All rights reserved.
43. 4 1
1
4 2
2
4 3
3
2 4
4
3: order by least frequent
1: records 1 discard common
1
2: count tokens
1 1
1 3
3 5: candidate pairs 3
4: uncommon features 6: final compare
in common
• 1 and 3 share uncommon features
• thus are candidates for a full comparison
Copyright Concurrent, Inc. 2011. All rights reserved.
44. Tokenize Count Job
Map Reduce Map Reduce
File
File File
Join Tokens/Counts Job
File Map Reduce
File
Sort/Prefix Filter Job
Map Reduce
File
Match two sets Self Join Job
Map Reduce
using prefix File
filtering Unique Pairs Job
Map Reduce
File
Join LHS Job
Map Reduce
File
Join RHS / Match Job
Map Reduce File
Copyright Concurrent, Inc. 2011. All rights reserved.
45. Duality
• Note the use of the previous patterns to route
data to implement a more efficient algorithm
Copyright Concurrent, Inc. 2011. All rights reserved.
46. Use a Higher
Abstraction
• Command Line
• Multitool - CLI for parallel sed, grep & joins
• API
• Cascading - Java Query API and Planner
• Plume - “approximate clone of FlumeJava”
• Interactive Shell
• Cascalog - Clojure+Cascading query language (API also)
• Pig - A text Syntax
• Hive - Syntax + Infrastructure - SQL “like”
Copyright Concurrent, Inc. 2011. All rights reserved.
47. References
• Set Similarity
• http://www.slideshare.net/ydn/4-similarity-joinshadoopsummit2010
• http://asterix.ics.uci.edu/fuzzyjoin-mapreduce/
• MapReduce Text Processing
• http://www.umiacs.umd.edu/~jimmylin/book.html
• Plume/FlumeJava
• http://portal.acm.org/citation.cfm?id=1806596.1806638
• http://github.com/tdunning/Plume/wiki
Copyright Concurrent, Inc. 2011. All rights reserved.
48. I’m Hiring
• Enterprise Java server and web client
• Language design, compilers, and interpreters
• No Hadoop experience required
• More info
• http://www.concurrentinc.com/careers/
Copyright Concurrent, Inc. 2011. All rights reserved.
49. Resources
• Chris K Wensel
•chris@wensel.net
•@cwensel
• Cascading & Cascalog
•http://cascading.org
•@cascading
• Concurrent, Inc.
•http://concurrentinc.com
•@concurrent
Copyright Concurrent, Inc. 2011. All rights reserved.
50. Appendix
Copyright Concurrent, Inc. 2011. All rights reserved.
51. Simple Total Sorting
• Where lines in a result file should be sorted
• Must set number of reducers to 1
• Sorting in MR is local per Reduce, not global across
Reducers
Copyright Concurrent, Inc. 2011. All rights reserved.
52. Why Sorting Isn’t
“Total”
[aaa,aab,aac] Mapper
aaa
Mapper aac Reducer [aaa,zzx]
aab
Mapper Reducer [aac,zzz]
zzx
Mapper zzz Reducer [aab,zzy]
zzy
[zzx,zzy,zzz] Mapper
• Keys emitted from Map are naturally sorted at a given Reducer
• But are Partitioned to Reducers in a random way
• Thus, only one Reducer can be used for a total sort
Copyright Concurrent, Inc. 2011. All rights reserved.
53. Distributed Total Sort
• To work, the Shuffling phase must be modified
with:
• Custom Partitioner to partition on the
distribution of ordered Keys
• Custom Comparator for comparing Key types
• Strings work by default
Copyright Concurrent, Inc. 2011. All rights reserved.
54. Distributed Total Sort -
Details
a ... z
ar ... ax za ... zo
ara ... ari axe ... axi zag ... zap zon ... zoo
aran aria axis zone
• Sample all K2 values and build balanced distribution for num reducers
• Sample all input keys and divide into partitions
• Write out boundaries of partitions
• Supply Partitioner that looks up partition for current K2 value
• Read boundaries into a Trie (pronounced ‘try’) data structure
• Use appropriate Comparator for Key type
Copyright Concurrent, Inc. 2011. All rights reserved.
Editor's Notes
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
- commutativity is the ability to change the order of something without changing the end result.\n- associativity is a property that a binary operation can have. It means that, within an expression containing two or more of the same associative operators in a row, the order of operations does not matter as long as the sequence of the operands is not changed. That is, rearranging the parentheses in such an expression will not change its value.\n