My name is Neta Barkay , and I'm a data scientist at LivePerson.
I'd like to share with you a talk I presented at the Underscore Scala community on "Efficient MapReduce using Scalding".
In this talk I reviewed why Scalding fits big data analysis, how it enables writing quick and intuitive code with the full functionality vanilla MapReduce has, without compromising on efficient execution on the Hadoop cluster. In addition, I presented some examples of Scalding jobs which can be used to get you started, and talked about how you can use Scalding's ecosystem, which includes Cascading and the monoids from Algebird library.
Read more & Video: https://connect.liveperson.com/community/developers/blog/2014/02/25/scalding-reaching-efficient-mapreduce
2. Outline
Scalding - Scala library that makes it easy
to write MapReduce jobs in Hadoop.
We will talk about:
• MapReduce paradigm
• Writing Scalding jobs
• Improving jobs performance
• Typed API, testing
3. Getting a glimpse of some Scalding code
class TopKJob(args : Args) extends Job(args){
val exclusions = Tsv(args("exclusions"), 'exVisitorId)
Tsv(args("input"), visitScheme)
.filter('country){country : String => country == "Israel"}
.leftJoinWithTiny('visitorId -> exVisitorId, exclusions)
.filter('exVisitorId){isEx : String => isEx
null}
.groupBy('section){_.sortWithTake(visitScheme -> 'top,
)(biggerSale)}
.flattenTo[visitType]{'top -> visitScheme}
.write(Tsv(args("output"), visitScheme))
}
4. Asking big data questions
Which questions will you ask?
What analysis will you do?
A possible approach:
Use the outliers to improve your product
• Most popular products on your site
• Visits that ended with the highest sale value
5. Asking big data questions
Which questions will you ask?
What analysis will you do?
A possible approach:
Use the outliers to improve your product
• Most popular products on your site
• Visits that ended with the highest sale value
That is the problem of finding the top elements in the data.
6. Data analysis problem
Top elements problem
Input
•
Data – arranged in records
•
K – number of top elements or p – percentage of top
elements to output
•
Order function – some ordering on the records
Output
•
K top records of our data or top p percentage according to
the order function
7. Algorithm flow
Read input records
Top K elements
problem
Input =
13, 55, 8, 2, 34, 89, 21, 8
K=5
Sort records, take top K
Output top records
Output =
89, 55, 34, 21, 13
8. Algorithm flow
Read input records
Top K elements
problem
Input =
13, 55, 8, 2, 34, 89, 21, 8
K=5
Sort records, take top K
Output top records
Output =
89, 55, 34, 21, 13
Scalding code
Tsv(args("input"), 'item)
.groupAll{_.sortWithTake('item -> 'top,
(a : Int, b : Int) => a > b}}
.write(Tsv(args("output"), 'top))
){
10. Algorithm flow
Read input records
Top K elements
problem
Filter records that fit
target population
Sort records, take top K
Output top records
11. Algorithm flow
Read input records
Top K elements
problem
Filter records that fit
target population
Divide to groups by site
section
Sort
records, tak
e top K
Sort
records, tak
e top K
Output top
records
Output top
records
12. Algorithm flow
Read input records
Top K elements
problem
Read exclusion list from
external source
Filter records that fit
target population
Filter out the visits from
the exclusion list
according to visitor id
Divide to groups by site
section
Sort
records,
take top K
Sort
records,
take top K
Output top
records
Output top
records
14. MapReduce on Hadoop
Big bottleneck
Block
Mapper
(k,v)
(k’1,v’1),(k’2,v’2)…
HDFS
Block n
Mapper n
(k,v)
(k’1,v’1),(k’2,v’2)…
Output
file
Reducer
(k', iterator(v'))
v’’1, v’’2…
Block
Mapper
(k,v)
(k’1,v’1),(k’2,v’2)…
Reducer
(k', iterator(v'))
v’’1, v’’2…
Output
file
15. Efficient MapReduce
Which tool
should we
use?
Have built-in
performanceoriginated features
Efficient
Execution
Easy to alter
And easy
maintenance
Full
Functionality
Fast
Code Writing
16. About Scalding
Scalding is a Scala library that makes it easy to write
MapReduce jobs in Hadoop. It's similar to other
MapReduce platforms like Pig and Hive, but offers a
higher level of abstraction by leveraging the full power of
Scala and the JVM
–Twitter
17. Algorithm flow
Read input records
Top K elements
problem
Read exclusion list from
external source
Filter records that fit
target population
Filter out the visits from
the exclusion list
according to visitor id
Divide to groups by site
section
Sort
records,
take top K
Sort
records,
take top K
Output top
records
Output top
records
24. MapReduce joins
We like to filter out the visits that appear in
the exclusion list:
visitorId
1
2
country
Israel
Israel
section
…
…
saleValue
…
…
3
Israel
…
…
exVisitorId
3
1
25. MapReduce joins
We like to filter out the visits that appear in
the exclusion list:
visitorId
1
2
country
Israel
Israel
section
…
…
saleValue
…
…
3
Israel
…
exVisitorId
3
…
visitorId
country
section
saleValue
1
exVisitorId
26. MapReduce joins
We like to filter out the visits that appear in
the exclusion list:
visitorId
1
2
country
Israel
Israel
section
…
…
saleValue
…
…
3
Israel
…
exVisitorId
3
…
1
visitorId
1
country
Israel
section
…
saleValue
…
exVisitorId
1
2
3
Israel
Israel
…
…
…
…
null
3
27. MapReduce joins
We like to filter out the visits that appear in
the exclusion list:
visitorId
1
2
country
Israel
Israel
section
…
…
saleValue
…
…
3
Israel
…
exVisitorId
3
…
1
visitorId
1
country
Israel
section
…
saleValue
…
exVisitorId
1
2
3
Israel
Israel
…
…
…
…
null
3
visitorId
country
section
saleValue
exVisitorId
2
Israel
…
…
null
33. Efficient MapReduce
MapReduce performance issues:
1. Traffic bottleneck between the mappers and the reduces.
The traffic bottleneck is when we take the top K elements.
•
•
We like to output from each mapper the top elements of its
input.
How is sortWithTake implemented?
34. Efficient performance using Algebird
sortWithTake uses:
class PriorityQueueMonoid[T](max : Int)(implicit
ord : Ordering[T]) extends Monoid[PriorityQueue[T]]
Defined in:
Algebird (Twitter): Abstract algebra for Scala, targeted at
building aggregation systems.
35. Efficient performance using Algebird
sortWithTake uses:
class PriorityQueueMonoid[T](max : Int)(implicit
ord : Ordering[T]) extends Monoid[PriorityQueue[T]]
PriorityQueue case:
Empty PriorityQueue
Two PriorityQueues can be added:
K=5
Q1: values = 55, 34, 21, 13, 8
Q2: values = 100, 80, 60, 40, 20
Q1 plus Q2: values: 100, 80, 60, 55, 40
Associative and commutative
36. Efficient performance using Algebird
All Monoid aggregations can start in Map phase, then
finish in Reduce phase. This decreases the amount
of traffic from the mappers to the reducers.
Performed implicitly when using Scalding built-in
aggregation functions:
average
sum
sizeAveStdev
histogram
approximateUniqueCount
sortWithTake
37. Improving performance
Our second performance issue:
What about the performance due to
inefficient order of the map and reduce
steps?
38. Top elements problem revisited
New problem definition:
Output the percentage p of top elements
instead of the fixed K top elements.
What is K?
K = p * count
39. Top %p of elements algorithm flow
Read input records
What is K?
K = p * count
…
Divide to groups by site
section
Count the
number of
records
Count the
number of
records
Sort
records
take top p
Sort
records
take top p
Output top
records
Output top
records
40. Top %p of elements scalding job
class TopPJob(args : Args) extends Job(args){
// visitScheme after join with exclusion list
val visits : RichPipe = …
val counts = visits
.groupBy('section){_.size('sectionSize)}
.map('sectionSize -> 'sectionK){size : Int => {size *
// taking top %p of elements
visits.joinWithTiny('section -> 'section, counts)
…
}
}.toInt}
41. Flow graph
How will this flow be executed on Hadoop?
•
How many MapReduce steps will be performed?
•
What will be the input to each step?
•
What logic will each contain?
42. Flow graph
How will this flow be executed on Hadoop?
•
How many MapReduce steps will be performed?
•
What will be the input to each step?
•
What logic will each contain?
Run with --tool.graph!
44. Flow graph
Split to
counting
Full flow in
Cascading
terminology
Reading input,
join with
exclusion list
Counting and
calculating K
Join with
counting
result
Joining with K
and sorting
46. Flow graph
And another graph:
source
source
Step number
Records input
Exclusion list
group
Step number
Records input
Exclusion list
group
Output file
sink
First
step
Second
step
47. Flow graph
Changing joining with exclusion list to
be performed only once:
val visits : RichPipe =
…
.project(visitScheme)
.forceToDisk
Only a single
line is added!
val counts = visits
.groupBy('section){_.size('sectionSize)}
…
visits.joinWithTiny('section -> 'section, counts)
…
48. Flow graph
The new map reduce steps:
source
Step number
Records input
Exclusion list
Step number
group
sink
Step number
group
Output file
First
step
Second
step
Third
step
49. Improving performance
We saw how:
• Writing Scalding jobs is simple, intuitive and fast.
• We can use external resources to improve the
performance of our algorithms. Scalding performs
some of this job implicitly for us.
• We can use Cascading library Scalding built on to
understand what are the exact steps that will run.
50. Additional features
Some other features in Scalding
• Typed API
TypedTsv[visitType](args("input"))
.filter(_._2 == "Israel")
.toPipe(visitScheme)
.toTypedPipe[visitType](visitScheme)
// TypedPipe[visitType]
// TypedPipe[visitType]
• Testing using JobTest
Give the input and get the output as Lists
• Matrix API
Useful for running graph algorithms such as PageRank
51. Scalding in LivePerson
How do we use
Scalding in LivePerson?
• The main tool in the Data Science team
• Both for quick data exploration, and in production jobs