SlideShare ist ein Scribd-Unternehmen logo
1 von 54
Efficient
MapReduce
using Scalding
Neta Barkay | Data Scientist, LivePerson | December
Outline

Scalding - Scala library that makes it easy
to write MapReduce jobs in Hadoop.

We will talk about:
• MapReduce paradigm
• Writing Scalding jobs
• Improving jobs performance
• Typed API, testing
Getting a glimpse of some Scalding code

class TopKJob(args : Args) extends Job(args){
val exclusions = Tsv(args("exclusions"), 'exVisitorId)
Tsv(args("input"), visitScheme)
.filter('country){country : String => country == "Israel"}
.leftJoinWithTiny('visitorId -> exVisitorId, exclusions)
.filter('exVisitorId){isEx : String => isEx
null}
.groupBy('section){_.sortWithTake(visitScheme -> 'top,
)(biggerSale)}
.flattenTo[visitType]{'top -> visitScheme}
.write(Tsv(args("output"), visitScheme))

}
Asking big data questions

Which questions will you ask?
What analysis will you do?
A possible approach:

Use the outliers to improve your product
• Most popular products on your site
• Visits that ended with the highest sale value
Asking big data questions

Which questions will you ask?
What analysis will you do?
A possible approach:

Use the outliers to improve your product
• Most popular products on your site
• Visits that ended with the highest sale value

That is the problem of finding the top elements in the data.
Data analysis problem

Top elements problem
Input
•

Data – arranged in records

•

K – number of top elements or p – percentage of top
elements to output

•

Order function – some ordering on the records

Output
•

K top records of our data or top p percentage according to
the order function
Algorithm flow

Read input records

Top K elements
problem

Input =
13, 55, 8, 2, 34, 89, 21, 8
K=5

Sort records, take top K

Output top records

Output =
89, 55, 34, 21, 13
Algorithm flow

Read input records

Top K elements
problem

Input =
13, 55, 8, 2, 34, 89, 21, 8
K=5

Sort records, take top K

Output top records

Output =
89, 55, 34, 21, 13

Scalding code
Tsv(args("input"), 'item)
.groupAll{_.sortWithTake('item -> 'top,
(a : Int, b : Int) => a > b}}
.write(Tsv(args("output"), 'top))

){
Algorithm flow

Read input records

Top K elements
problem

Sort records, take top K

Output top records
Algorithm flow

Read input records

Top K elements
problem

Filter records that fit
target population

Sort records, take top K

Output top records
Algorithm flow

Read input records

Top K elements
problem

Filter records that fit
target population

Divide to groups by site
section

Sort
records, tak
e top K

Sort
records, tak
e top K

Output top
records

Output top
records
Algorithm flow

Read input records

Top K elements
problem

Read exclusion list from
external source

Filter records that fit
target population

Filter out the visits from
the exclusion list
according to visitor id

Divide to groups by site
section

Sort
records,
take top K

Sort
records,
take top K

Output top
records

Output top
records
MapReduce on Hadoop

Block

Mapper
(k,v) 
(k’1,v’1),(k’2,v’2)…

HDFS

Block n

Mapper n
(k,v) 
(k’1,v’1),(k’2,v’2)…

Output
file

Reducer
(k', iterator(v'))
v’’1, v’’2…

Block

Mapper
(k,v) 
(k’1,v’1),(k’2,v’2)…

Reducer
(k', iterator(v'))
v’’1, v’’2…

Output
file
MapReduce on Hadoop

Big bottleneck
Block

Mapper
(k,v) 
(k’1,v’1),(k’2,v’2)…

HDFS

Block n

Mapper n
(k,v) 
(k’1,v’1),(k’2,v’2)…

Output
file

Reducer
(k', iterator(v'))
v’’1, v’’2…

Block

Mapper
(k,v) 
(k’1,v’1),(k’2,v’2)…

Reducer
(k', iterator(v'))
v’’1, v’’2…

Output
file
Efficient MapReduce

Which tool
should we
use?

Have built-in
performanceoriginated features

Efficient
Execution

Easy to alter

And easy
maintenance

Full
Functionality

Fast
Code Writing
About Scalding

Scalding is a Scala library that makes it easy to write

MapReduce jobs in Hadoop. It's similar to other
MapReduce platforms like Pig and Hive, but offers a
higher level of abstraction by leveraging the full power of
Scala and the JVM
–Twitter
Algorithm flow

Read input records

Top K elements
problem

Read exclusion list from
external source

Filter records that fit
target population

Filter out the visits from
the exclusion list
according to visitor id

Divide to groups by site
section

Sort
records,
take top K

Sort
records,
take top K

Output top
records

Output top
records
Simple Scalding job

val visitScheme = ('visitorId, 'country, 'section, 'saleValue)
type visitType = (String, String, String, Double)
def biggerSale(v1 : visitType, v2: visitType) = v1._4 > v2._4
Simple Scalding job

val visitScheme = ('visitorId, 'country, 'section, 'saleValue)
type visitType = (String, String, String, Double)
def biggerSale(v1 : visitType, v2: visitType) = v1._4 > v2._4

import com.twitter.scalding._
class TopKJob(args : Args) extends Job(args){
Tsv(args("input"), visitScheme)
.filter('country){country : String => country == "Israel"}
.groupBy('section){_.sortWithTake(visitScheme -> 'top,
)(biggerSale)}
.flattenTo[visitType]{'top -> visitScheme}
.write(Tsv(args("output"), visitScheme))

}
Simple Scalding job

val visitScheme = ('visitorId, 'country, 'section, 'saleValue)
type visitType = (String, String, String, Double)
def biggerSale(v1 : visitType, v2: visitType) = v1._4 > v2._4

import com.twitter.scalding._
class TopKJob(args : Args) extends Job(args){
Tsv(args("input"), visitScheme)
.filter('country){country : String => country == "Israel"}
.groupBy('section){_.sortWithTake(visitScheme -> 'top,
)(biggerSale)}
.flattenTo[visitType]{'top -> visitScheme}
.write(Tsv(args("output"), visitScheme))

}
Simple Scalding job

val visitScheme = ('visitorId, 'country, 'section, 'saleValue)
type visitType = (String, String, String, Double)
def biggerSale(v1 : visitType, v2: visitType) = v1._4 > v2._4

import com.twitter.scalding._
class TopKJob(args : Args) extends Job(args){
Tsv(args("input"), visitScheme)
.filter('country){country : String => country == "Israel"}
.groupBy('section){_.sortWithTake(visitScheme -> 'top,
)(biggerSale)}
.flattenTo[visitType]{'top -> visitScheme}
.write(Tsv(args("output"), visitScheme))

}
Simple Scalding job

val visitScheme = ('visitorId, 'country, 'section, 'saleValue)
type visitType = (String, String, String, Double)
def biggerSale(v1 : visitType, v2: visitType) = v1._4 > v2._4

import com.twitter.scalding._
class TopKJob(args : Args) extends Job(args){
Tsv(args("input"), visitScheme)
.filter('country){country : String => country == "Israel"}
.groupBy('section){_.sortWithTake(visitScheme -> 'top,
)(biggerSale)}
.flattenTo[visitType]{'top -> visitScheme}
.write(Tsv(args("output"), visitScheme))

}
Simple Scalding job

val visitScheme = ('visitorId, 'country, 'section, 'saleValue)
type visitType = (String, String, String, Double)
def biggerSale(v1 : visitType, v2: visitType) = v1._4 > v2._4

import com.twitter.scalding._
class TopKJob(args : Args) extends Job(args){
Tsv(args("input"), visitScheme)
.filter('country){country : String => country == "Israel"}
.groupBy('section){_.sortWithTake(visitScheme -> 'top,
)(biggerSale)}
.flattenTo[visitType]{'top -> visitScheme}
.write(Tsv(args("output"), visitScheme))

}
MapReduce joins

We like to filter out the visits that appear in
the exclusion list:
visitorId
1
2

country
Israel
Israel

section
…
…

saleValue
…
…

3

Israel

…

…

exVisitorId
3

1
MapReduce joins

We like to filter out the visits that appear in
the exclusion list:
visitorId
1
2

country
Israel
Israel

section
…
…

saleValue
…
…

3

Israel

…

exVisitorId
3

…

visitorId

country

section

saleValue

1

exVisitorId
MapReduce joins

We like to filter out the visits that appear in
the exclusion list:
visitorId
1
2

country
Israel
Israel

section
…
…

saleValue
…
…

3

Israel

…

exVisitorId
3

…

1

visitorId
1

country
Israel

section
…

saleValue
…

exVisitorId
1

2
3

Israel
Israel

…
…

…
…

null
3
MapReduce joins

We like to filter out the visits that appear in
the exclusion list:
visitorId
1
2

country
Israel
Israel

section
…
…

saleValue
…
…

3

Israel

…

exVisitorId
3

…

1

visitorId
1

country
Israel

section
…

saleValue
…

exVisitorId
1

2
3

Israel
Israel

…
…

…
…

null
3

visitorId

country

section

saleValue

exVisitorId

2

Israel

…

…

null
Simple Scalding job

Filtering using JoinWithTiny:
class TopKJob(args : Args) extends Job(args){
val exclusions = Tsv(args("exclusions"), 'exVisitorId)
Tsv(args("input"), visitScheme)
.filter('country){country : String => country == "Israel"}
.leftJoinWithTiny('visitorId -> exVisitorId, exclusions)
.filter('exVisitorId){isEx : String => isEx
null}
.groupBy('section){_.sortWithTake(visitScheme -> 'top,
)(biggerSale)}
.flattenTo[visitType]{'top -> visitScheme}
.write(Tsv(args("output"), visitScheme))
}
Simple Scalding job

Filtering using JoinWithTiny:
class TopKJob(args : Args) extends Job(args){
val exclusions = Tsv(args("exclusions"), 'exVisitorId)
Tsv(args("input"), visitScheme)
.filter('country){country : String => country == "Israel"}
.leftJoinWithTiny('visitorId -> exVisitorId, exclusions)
.filter('exVisitorId){isEx : String => isEx
null}
.groupBy('section){_.sortWithTake(visitScheme -> 'top,
)(biggerSale)}
.flattenTo[visitType]{'top -> visitScheme}
.write(Tsv(args("output"), visitScheme))
}
Simple Scalding job

Functionality is complete

What's next
Efficient MapReduce

Functionality is complete

What's next

Efficient
Execution

Full
Functionality

Fast

Code Writing
Efficient MapReduce

MapReduce performance issues:

1. Traffic bottleneck between the mappers and the reduces.
2. Inefficient order of map and reduce steps.
Efficient MapReduce

MapReduce performance issues:

1. Traffic bottleneck between the mappers and the reduces.

The traffic bottleneck is when we take the top K elements.

•
•

We like to output from each mapper the top elements of its
input.
How is sortWithTake implemented?
Efficient performance using Algebird

sortWithTake uses:
class PriorityQueueMonoid[T](max : Int)(implicit
ord : Ordering[T]) extends Monoid[PriorityQueue[T]]

Defined in:
Algebird (Twitter): Abstract algebra for Scala, targeted at
building aggregation systems.
Efficient performance using Algebird

sortWithTake uses:
class PriorityQueueMonoid[T](max : Int)(implicit
ord : Ordering[T]) extends Monoid[PriorityQueue[T]]

PriorityQueue case:
Empty PriorityQueue
Two PriorityQueues can be added:
K=5
Q1: values = 55, 34, 21, 13, 8
Q2: values = 100, 80, 60, 40, 20
Q1 plus Q2: values: 100, 80, 60, 55, 40

Associative and commutative
Efficient performance using Algebird

All Monoid aggregations can start in Map phase, then
finish in Reduce phase. This decreases the amount
of traffic from the mappers to the reducers.
Performed implicitly when using Scalding built-in
aggregation functions:
average
sum
sizeAveStdev
histogram
approximateUniqueCount
sortWithTake
Improving performance

Our second performance issue:

What about the performance due to
inefficient order of the map and reduce
steps?
Top elements problem revisited

New problem definition:

Output the percentage p of top elements
instead of the fixed K top elements.

What is K?
K = p * count
Top %p of elements algorithm flow

Read input records

What is K?
K = p * count

…

Divide to groups by site
section

Count the
number of
records

Count the
number of
records

Sort
records
take top p

Sort
records
take top p

Output top
records

Output top
records
Top %p of elements scalding job
class TopPJob(args : Args) extends Job(args){
// visitScheme after join with exclusion list
val visits : RichPipe = …
val counts = visits
.groupBy('section){_.size('sectionSize)}
.map('sectionSize -> 'sectionK){size : Int => {size *
// taking top %p of elements
visits.joinWithTiny('section -> 'section, counts)
…
}

}.toInt}
Flow graph

How will this flow be executed on Hadoop?
•

How many MapReduce steps will be performed?

•

What will be the input to each step?

•

What logic will each contain?
Flow graph

How will this flow be executed on Hadoop?
•

How many MapReduce steps will be performed?

•

What will be the input to each step?

•

What logic will each contain?

Run with --tool.graph!
Flow graph

Full flow in
Cascading
terminology
Flow graph

Split to
counting

Full flow in
Cascading
terminology

Reading input,
join with
exclusion list

Counting and
calculating K

Join with
counting
result
Joining with K
and sorting
Flow graph

And another graph:
Flow graph

And another graph:

source

source

Step number
Records input
Exclusion list
group

Step number
Records input
Exclusion list
group
Output file
sink

First
step

Second
step
Flow graph

Changing joining with exclusion list to
be performed only once:
val visits : RichPipe =
…
.project(visitScheme)
.forceToDisk

Only a single
line is added!

val counts = visits
.groupBy('section){_.size('sectionSize)}
…
visits.joinWithTiny('section -> 'section, counts)
…
Flow graph

The new map reduce steps:

source

Step number
Records input
Exclusion list

Step number
group

sink

Step number
group
Output file

First
step

Second
step

Third
step
Improving performance

We saw how:
• Writing Scalding jobs is simple, intuitive and fast.
• We can use external resources to improve the

performance of our algorithms. Scalding performs
some of this job implicitly for us.
• We can use Cascading library Scalding built on to

understand what are the exact steps that will run.
Additional features

Some other features in Scalding
• Typed API
TypedTsv[visitType](args("input"))
.filter(_._2 == "Israel")
.toPipe(visitScheme)
.toTypedPipe[visitType](visitScheme)

// TypedPipe[visitType]
// TypedPipe[visitType]

• Testing using JobTest

Give the input and get the output as Lists
• Matrix API

Useful for running graph algorithms such as PageRank
Scalding in LivePerson

How do we use
Scalding in LivePerson?

• The main tool in the Data Science team
• Both for quick data exploration, and in production jobs
LivePerson Developers

developer.liveperson.com
apps.liveperson.com

YouTube.com/LivePersonDev
Twitter.com/LivePersonDev
Facebook.com/LivePersonDev
Thank You!
Contact info:
netab@liveperson.com
netabarkay@gmail.com

We are hiring!
Scalding: Reaching Efficient MapReduce

Weitere ähnliche Inhalte

Was ist angesagt?

Scoobi - Scala for Startups
Scoobi - Scala for StartupsScoobi - Scala for Startups
Scoobi - Scala for Startupsbmlever
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebookragho
 
MongoDB - Aggregation Pipeline
MongoDB - Aggregation PipelineMongoDB - Aggregation Pipeline
MongoDB - Aggregation PipelineJason Terpko
 
MongoDB Aggregation
MongoDB Aggregation MongoDB Aggregation
MongoDB Aggregation Amit Ghosh
 
Wprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopWprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopSages
 
Cassandra 3.0 - JSON at scale - StampedeCon 2015
Cassandra 3.0 - JSON at scale - StampedeCon 2015Cassandra 3.0 - JSON at scale - StampedeCon 2015
Cassandra 3.0 - JSON at scale - StampedeCon 2015StampedeCon
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGMatthew McCullough
 
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Using Cerberus and PySpark to validate semi-structured datasets
Using Cerberus and PySpark to validate semi-structured datasetsUsing Cerberus and PySpark to validate semi-structured datasets
Using Cerberus and PySpark to validate semi-structured datasetsBartosz Konieczny
 
Algebird : Abstract Algebra for big data analytics. Devoxx 2014
Algebird : Abstract Algebra for big data analytics. Devoxx 2014Algebird : Abstract Algebra for big data analytics. Devoxx 2014
Algebird : Abstract Algebra for big data analytics. Devoxx 2014Samir Bessalah
 
Scalding Presentation
Scalding PresentationScalding Presentation
Scalding PresentationLandoop Ltd
 
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant) Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant) BigDataEverywhere
 
ComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical SciencesComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical Sciencesalexstorer
 
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash courseCodepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash courseSages
 
2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locator2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locatorAlberto Paro
 
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...Lucidworks
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionDatabricks
 

Was ist angesagt? (20)

Scoobi - Scala for Startups
Scoobi - Scala for StartupsScoobi - Scala for Startups
Scoobi - Scala for Startups
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
MongoDB - Aggregation Pipeline
MongoDB - Aggregation PipelineMongoDB - Aggregation Pipeline
MongoDB - Aggregation Pipeline
 
Spark workshop
Spark workshopSpark workshop
Spark workshop
 
MongoDB Aggregation
MongoDB Aggregation MongoDB Aggregation
MongoDB Aggregation
 
Wprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopWprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache Hadoop
 
Cassandra 3.0 - JSON at scale - StampedeCon 2015
Cassandra 3.0 - JSON at scale - StampedeCon 2015Cassandra 3.0 - JSON at scale - StampedeCon 2015
Cassandra 3.0 - JSON at scale - StampedeCon 2015
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUG
 
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Using Cerberus and PySpark to validate semi-structured datasets
Using Cerberus and PySpark to validate semi-structured datasetsUsing Cerberus and PySpark to validate semi-structured datasets
Using Cerberus and PySpark to validate semi-structured datasets
 
Algebird : Abstract Algebra for big data analytics. Devoxx 2014
Algebird : Abstract Algebra for big data analytics. Devoxx 2014Algebird : Abstract Algebra for big data analytics. Devoxx 2014
Algebird : Abstract Algebra for big data analytics. Devoxx 2014
 
Scalding Presentation
Scalding PresentationScalding Presentation
Scalding Presentation
 
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant) Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
 
ComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical SciencesComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical Sciences
 
wtf is in Java/JDK/wtf7?
wtf is in Java/JDK/wtf7?wtf is in Java/JDK/wtf7?
wtf is in Java/JDK/wtf7?
 
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash courseCodepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
 
Tuples All the Way Down
Tuples All the Way DownTuples All the Way Down
Tuples All the Way Down
 
2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locator2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locator
 
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
 

Andere mochten auch

Monoids monoids everywhere
Monoids monoids everywhereMonoids monoids everywhere
Monoids monoids everywhereKevin Faro
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hiveSubhas Kumar Ghosh
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepSubhas Kumar Ghosh
 
Kavita lakhani- Advanced Program in Digital Marketing - Project Submission AP...
Kavita lakhani- Advanced Program in Digital Marketing - Project Submission AP...Kavita lakhani- Advanced Program in Digital Marketing - Project Submission AP...
Kavita lakhani- Advanced Program in Digital Marketing - Project Submission AP...Kavita Lakhani
 
Twitter 101 Arabic عالم التويتر
Twitter 101 Arabic عالم التويتر Twitter 101 Arabic عالم التويتر
Twitter 101 Arabic عالم التويتر DigiArabs
 
Ekonomisten Konpetentzia Profesionalak Detektatzeko Azterlana
Ekonomisten Konpetentzia Profesionalak Detektatzeko AzterlanaEkonomisten Konpetentzia Profesionalak Detektatzeko Azterlana
Ekonomisten Konpetentzia Profesionalak Detektatzeko Azterlanaekonomistak
 
Set a password in a word document
Set a password in a word documentSet a password in a word document
Set a password in a word documentRavi Kumar Lanke
 
Ad Wars 2010 Questions
Ad Wars 2010 QuestionsAd Wars 2010 Questions
Ad Wars 2010 QuestionsKyle Rohde
 
Texas Navigator Workflows Webinar
Texas Navigator Workflows WebinarTexas Navigator Workflows Webinar
Texas Navigator Workflows WebinarSue Bennett
 
Hanemaaijer governance nieuw
Hanemaaijer   governance nieuwHanemaaijer   governance nieuw
Hanemaaijer governance nieuwAtrivé
 
How to Run Successful Adwords Campaigns for Multi-Location Businesses
How to Run Successful Adwords Campaigns for Multi-Location BusinessesHow to Run Successful Adwords Campaigns for Multi-Location Businesses
How to Run Successful Adwords Campaigns for Multi-Location BusinessesPowered by Search
 
Laura Young Bost: "What If: Permissions Issues When Moving Book to E-Book"--A...
Laura Young Bost: "What If: Permissions Issues When Moving Book to E-Book"--A...Laura Young Bost: "What If: Permissions Issues When Moving Book to E-Book"--A...
Laura Young Bost: "What If: Permissions Issues When Moving Book to E-Book"--A...Krista Coulson
 
Edward jones
Edward jonesEdward jones
Edward jonesCRittle
 
PBM Gay Market Presentation
PBM Gay Market PresentationPBM Gay Market Presentation
PBM Gay Market PresentationMatt Skallerud
 
Pengeualaran Daerah Efektif
Pengeualaran Daerah EfektifPengeualaran Daerah Efektif
Pengeualaran Daerah Efektifguest5fc123f
 

Andere mochten auch (20)

Monoids monoids everywhere
Monoids monoids everywhereMonoids monoids everywhere
Monoids monoids everywhere
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hive
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by step
 
Een duurzaam meerjarenpan - u gaat niet langs af
Een duurzaam meerjarenpan - u gaat niet langs afEen duurzaam meerjarenpan - u gaat niet langs af
Een duurzaam meerjarenpan - u gaat niet langs af
 
Kavita lakhani- Advanced Program in Digital Marketing - Project Submission AP...
Kavita lakhani- Advanced Program in Digital Marketing - Project Submission AP...Kavita lakhani- Advanced Program in Digital Marketing - Project Submission AP...
Kavita lakhani- Advanced Program in Digital Marketing - Project Submission AP...
 
Twitter 101 Arabic عالم التويتر
Twitter 101 Arabic عالم التويتر Twitter 101 Arabic عالم التويتر
Twitter 101 Arabic عالم التويتر
 
Ekonomisten Konpetentzia Profesionalak Detektatzeko Azterlana
Ekonomisten Konpetentzia Profesionalak Detektatzeko AzterlanaEkonomisten Konpetentzia Profesionalak Detektatzeko Azterlana
Ekonomisten Konpetentzia Profesionalak Detektatzeko Azterlana
 
How to install windows 98
How to install windows 98How to install windows 98
How to install windows 98
 
Set a password in a word document
Set a password in a word documentSet a password in a word document
Set a password in a word document
 
Ad Wars 2010 Questions
Ad Wars 2010 QuestionsAd Wars 2010 Questions
Ad Wars 2010 Questions
 
Texas Navigator Workflows Webinar
Texas Navigator Workflows WebinarTexas Navigator Workflows Webinar
Texas Navigator Workflows Webinar
 
Hanemaaijer governance nieuw
Hanemaaijer   governance nieuwHanemaaijer   governance nieuw
Hanemaaijer governance nieuw
 
How to Run Successful Adwords Campaigns for Multi-Location Businesses
How to Run Successful Adwords Campaigns for Multi-Location BusinessesHow to Run Successful Adwords Campaigns for Multi-Location Businesses
How to Run Successful Adwords Campaigns for Multi-Location Businesses
 
Audolici
AudoliciAudolici
Audolici
 
Open Content and the Global Text Project
Open Content and the Global Text ProjectOpen Content and the Global Text Project
Open Content and the Global Text Project
 
Laura Young Bost: "What If: Permissions Issues When Moving Book to E-Book"--A...
Laura Young Bost: "What If: Permissions Issues When Moving Book to E-Book"--A...Laura Young Bost: "What If: Permissions Issues When Moving Book to E-Book"--A...
Laura Young Bost: "What If: Permissions Issues When Moving Book to E-Book"--A...
 
Puls Media Network
Puls Media NetworkPuls Media Network
Puls Media Network
 
Edward jones
Edward jonesEdward jones
Edward jones
 
PBM Gay Market Presentation
PBM Gay Market PresentationPBM Gay Market Presentation
PBM Gay Market Presentation
 
Pengeualaran Daerah Efektif
Pengeualaran Daerah EfektifPengeualaran Daerah Efektif
Pengeualaran Daerah Efektif
 

Ähnlich wie Scalding: Reaching Efficient MapReduce

CS442 - Rogue: A Scala DSL for MongoDB
CS442 - Rogue: A Scala DSL for MongoDBCS442 - Rogue: A Scala DSL for MongoDB
CS442 - Rogue: A Scala DSL for MongoDBjorgeortiz85
 
The Ring programming language version 1.5 book - Part 8 of 31
The Ring programming language version 1.5 book - Part 8 of 31The Ring programming language version 1.5 book - Part 8 of 31
The Ring programming language version 1.5 book - Part 8 of 31Mahmoud Samir Fayed
 
Patterns for slick database applications
Patterns for slick database applicationsPatterns for slick database applications
Patterns for slick database applicationsSkills Matter
 
The Ring programming language version 1.7 book - Part 48 of 196
The Ring programming language version 1.7 book - Part 48 of 196The Ring programming language version 1.7 book - Part 48 of 196
The Ring programming language version 1.7 book - Part 48 of 196Mahmoud Samir Fayed
 
Metaprogramming in Haskell
Metaprogramming in HaskellMetaprogramming in Haskell
Metaprogramming in HaskellHiromi Ishii
 
Scala Days 2011 - Rogue: A Type-Safe DSL for MongoDB
Scala Days 2011 - Rogue: A Type-Safe DSL for MongoDBScala Days 2011 - Rogue: A Type-Safe DSL for MongoDB
Scala Days 2011 - Rogue: A Type-Safe DSL for MongoDBjorgeortiz85
 
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with ClojureDmitry Buzdin
 
RESTful API using scalaz (3)
RESTful API using scalaz (3)RESTful API using scalaz (3)
RESTful API using scalaz (3)Yeshwanth Kumar
 
Swift Sequences & Collections
Swift Sequences & CollectionsSwift Sequences & Collections
Swift Sequences & CollectionsCocoaHeads France
 
Apache spark: in and out
Apache spark: in and outApache spark: in and out
Apache spark: in and outBen Fradet
 
JDD 2016 - Pawel Byszewski - Kotlin, why?
JDD 2016 - Pawel Byszewski - Kotlin, why?JDD 2016 - Pawel Byszewski - Kotlin, why?
JDD 2016 - Pawel Byszewski - Kotlin, why?PROIDEA
 
Legacy lambda code
Legacy lambda codeLegacy lambda code
Legacy lambda codePeter Lawrey
 
Beginning Scala Svcc 2009
Beginning Scala Svcc 2009Beginning Scala Svcc 2009
Beginning Scala Svcc 2009David Pollak
 
AST - the only true tool for building JavaScript
AST - the only true tool for building JavaScriptAST - the only true tool for building JavaScript
AST - the only true tool for building JavaScriptIngvar Stepanyan
 
mobl presentation @ IHomer
mobl presentation @ IHomermobl presentation @ IHomer
mobl presentation @ IHomerzefhemel
 
Scala @ TechMeetup Edinburgh
Scala @ TechMeetup EdinburghScala @ TechMeetup Edinburgh
Scala @ TechMeetup EdinburghStuart Roebuck
 
Type safe embedded domain-specific languages
Type safe embedded domain-specific languagesType safe embedded domain-specific languages
Type safe embedded domain-specific languagesArthur Xavier
 
Functional Principles for OO Developers
Functional Principles for OO DevelopersFunctional Principles for OO Developers
Functional Principles for OO Developersjessitron
 

Ähnlich wie Scalding: Reaching Efficient MapReduce (20)

Scala in Places API
Scala in Places APIScala in Places API
Scala in Places API
 
CS442 - Rogue: A Scala DSL for MongoDB
CS442 - Rogue: A Scala DSL for MongoDBCS442 - Rogue: A Scala DSL for MongoDB
CS442 - Rogue: A Scala DSL for MongoDB
 
The Ring programming language version 1.5 book - Part 8 of 31
The Ring programming language version 1.5 book - Part 8 of 31The Ring programming language version 1.5 book - Part 8 of 31
The Ring programming language version 1.5 book - Part 8 of 31
 
Patterns for slick database applications
Patterns for slick database applicationsPatterns for slick database applications
Patterns for slick database applications
 
The Ring programming language version 1.7 book - Part 48 of 196
The Ring programming language version 1.7 book - Part 48 of 196The Ring programming language version 1.7 book - Part 48 of 196
The Ring programming language version 1.7 book - Part 48 of 196
 
Metaprogramming in Haskell
Metaprogramming in HaskellMetaprogramming in Haskell
Metaprogramming in Haskell
 
Scala Days 2011 - Rogue: A Type-Safe DSL for MongoDB
Scala Days 2011 - Rogue: A Type-Safe DSL for MongoDBScala Days 2011 - Rogue: A Type-Safe DSL for MongoDB
Scala Days 2011 - Rogue: A Type-Safe DSL for MongoDB
 
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with Clojure
 
RESTful API using scalaz (3)
RESTful API using scalaz (3)RESTful API using scalaz (3)
RESTful API using scalaz (3)
 
Swift Sequences & Collections
Swift Sequences & CollectionsSwift Sequences & Collections
Swift Sequences & Collections
 
Apache spark: in and out
Apache spark: in and outApache spark: in and out
Apache spark: in and out
 
JDD 2016 - Pawel Byszewski - Kotlin, why?
JDD 2016 - Pawel Byszewski - Kotlin, why?JDD 2016 - Pawel Byszewski - Kotlin, why?
JDD 2016 - Pawel Byszewski - Kotlin, why?
 
Legacy lambda code
Legacy lambda codeLegacy lambda code
Legacy lambda code
 
Beginning Scala Svcc 2009
Beginning Scala Svcc 2009Beginning Scala Svcc 2009
Beginning Scala Svcc 2009
 
AST - the only true tool for building JavaScript
AST - the only true tool for building JavaScriptAST - the only true tool for building JavaScript
AST - the only true tool for building JavaScript
 
mobl presentation @ IHomer
mobl presentation @ IHomermobl presentation @ IHomer
mobl presentation @ IHomer
 
Scala @ TechMeetup Edinburgh
Scala @ TechMeetup EdinburghScala @ TechMeetup Edinburgh
Scala @ TechMeetup Edinburgh
 
Scala on Your Phone
Scala on Your PhoneScala on Your Phone
Scala on Your Phone
 
Type safe embedded domain-specific languages
Type safe embedded domain-specific languagesType safe embedded domain-specific languages
Type safe embedded domain-specific languages
 
Functional Principles for OO Developers
Functional Principles for OO DevelopersFunctional Principles for OO Developers
Functional Principles for OO Developers
 

Mehr von LivePerson

Microservices on top of kafka
Microservices on top of kafkaMicroservices on top of kafka
Microservices on top of kafkaLivePerson
 
Graph QL Introduction
Graph QL IntroductionGraph QL Introduction
Graph QL IntroductionLivePerson
 
Kubernetes your tests! automation with docker on google cloud platform
Kubernetes your tests! automation with docker on google cloud platformKubernetes your tests! automation with docker on google cloud platform
Kubernetes your tests! automation with docker on google cloud platformLivePerson
 
Growing into a proactive Data Platform
Growing into a proactive Data PlatformGrowing into a proactive Data Platform
Growing into a proactive Data PlatformLivePerson
 
Measure() or die()
Measure() or die() Measure() or die()
Measure() or die() LivePerson
 
Resilience from Theory to Practice
Resilience from Theory to PracticeResilience from Theory to Practice
Resilience from Theory to PracticeLivePerson
 
System Revolution- How We Did It
System Revolution- How We Did It System Revolution- How We Did It
System Revolution- How We Did It LivePerson
 
Liveperson DLD 2015
Liveperson DLD 2015 Liveperson DLD 2015
Liveperson DLD 2015 LivePerson
 
Http 2: Should I care?
Http 2: Should I care?Http 2: Should I care?
Http 2: Should I care?LivePerson
 
Mobile app real-time content modifications using websockets
Mobile app real-time content modifications using websocketsMobile app real-time content modifications using websockets
Mobile app real-time content modifications using websocketsLivePerson
 
Mobile SDK: Considerations & Best Practices
Mobile SDK: Considerations & Best Practices Mobile SDK: Considerations & Best Practices
Mobile SDK: Considerations & Best Practices LivePerson
 
Functional programming with Java 8
Functional programming with Java 8Functional programming with Java 8
Functional programming with Java 8LivePerson
 
Apache Avro in LivePerson [Hebrew]
Apache Avro in LivePerson [Hebrew]Apache Avro in LivePerson [Hebrew]
Apache Avro in LivePerson [Hebrew]LivePerson
 
Apache Avro and Messaging at Scale in LivePerson
Apache Avro and Messaging at Scale in LivePersonApache Avro and Messaging at Scale in LivePerson
Apache Avro and Messaging at Scale in LivePersonLivePerson
 
Data compression in Modern Application
Data compression in Modern ApplicationData compression in Modern Application
Data compression in Modern ApplicationLivePerson
 
Support Office Hour Webinar - LivePerson API
Support Office Hour Webinar - LivePerson API Support Office Hour Webinar - LivePerson API
Support Office Hour Webinar - LivePerson API LivePerson
 
SIP - Introduction to SIP Protocol
SIP - Introduction to SIP ProtocolSIP - Introduction to SIP Protocol
SIP - Introduction to SIP ProtocolLivePerson
 
Building Enterprise Level End-To-End Monitor System with Open Source Solution...
Building Enterprise Level End-To-End Monitor System with Open Source Solution...Building Enterprise Level End-To-End Monitor System with Open Source Solution...
Building Enterprise Level End-To-End Monitor System with Open Source Solution...LivePerson
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceLivePerson
 
From a Kafkaesque Story to The Promised Land at LivePerson
From a Kafkaesque Story to The Promised Land at LivePersonFrom a Kafkaesque Story to The Promised Land at LivePerson
From a Kafkaesque Story to The Promised Land at LivePersonLivePerson
 

Mehr von LivePerson (20)

Microservices on top of kafka
Microservices on top of kafkaMicroservices on top of kafka
Microservices on top of kafka
 
Graph QL Introduction
Graph QL IntroductionGraph QL Introduction
Graph QL Introduction
 
Kubernetes your tests! automation with docker on google cloud platform
Kubernetes your tests! automation with docker on google cloud platformKubernetes your tests! automation with docker on google cloud platform
Kubernetes your tests! automation with docker on google cloud platform
 
Growing into a proactive Data Platform
Growing into a proactive Data PlatformGrowing into a proactive Data Platform
Growing into a proactive Data Platform
 
Measure() or die()
Measure() or die() Measure() or die()
Measure() or die()
 
Resilience from Theory to Practice
Resilience from Theory to PracticeResilience from Theory to Practice
Resilience from Theory to Practice
 
System Revolution- How We Did It
System Revolution- How We Did It System Revolution- How We Did It
System Revolution- How We Did It
 
Liveperson DLD 2015
Liveperson DLD 2015 Liveperson DLD 2015
Liveperson DLD 2015
 
Http 2: Should I care?
Http 2: Should I care?Http 2: Should I care?
Http 2: Should I care?
 
Mobile app real-time content modifications using websockets
Mobile app real-time content modifications using websocketsMobile app real-time content modifications using websockets
Mobile app real-time content modifications using websockets
 
Mobile SDK: Considerations & Best Practices
Mobile SDK: Considerations & Best Practices Mobile SDK: Considerations & Best Practices
Mobile SDK: Considerations & Best Practices
 
Functional programming with Java 8
Functional programming with Java 8Functional programming with Java 8
Functional programming with Java 8
 
Apache Avro in LivePerson [Hebrew]
Apache Avro in LivePerson [Hebrew]Apache Avro in LivePerson [Hebrew]
Apache Avro in LivePerson [Hebrew]
 
Apache Avro and Messaging at Scale in LivePerson
Apache Avro and Messaging at Scale in LivePersonApache Avro and Messaging at Scale in LivePerson
Apache Avro and Messaging at Scale in LivePerson
 
Data compression in Modern Application
Data compression in Modern ApplicationData compression in Modern Application
Data compression in Modern Application
 
Support Office Hour Webinar - LivePerson API
Support Office Hour Webinar - LivePerson API Support Office Hour Webinar - LivePerson API
Support Office Hour Webinar - LivePerson API
 
SIP - Introduction to SIP Protocol
SIP - Introduction to SIP ProtocolSIP - Introduction to SIP Protocol
SIP - Introduction to SIP Protocol
 
Building Enterprise Level End-To-End Monitor System with Open Source Solution...
Building Enterprise Level End-To-End Monitor System with Open Source Solution...Building Enterprise Level End-To-End Monitor System with Open Source Solution...
Building Enterprise Level End-To-End Monitor System with Open Source Solution...
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
From a Kafkaesque Story to The Promised Land at LivePerson
From a Kafkaesque Story to The Promised Land at LivePersonFrom a Kafkaesque Story to The Promised Land at LivePerson
From a Kafkaesque Story to The Promised Land at LivePerson
 

Kürzlich hochgeladen

Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxShobhayan Kirtania
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...anjaliyadav012327
 

Kürzlich hochgeladen (20)

Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptx
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
 

Scalding: Reaching Efficient MapReduce

  • 1. Efficient MapReduce using Scalding Neta Barkay | Data Scientist, LivePerson | December
  • 2. Outline Scalding - Scala library that makes it easy to write MapReduce jobs in Hadoop. We will talk about: • MapReduce paradigm • Writing Scalding jobs • Improving jobs performance • Typed API, testing
  • 3. Getting a glimpse of some Scalding code class TopKJob(args : Args) extends Job(args){ val exclusions = Tsv(args("exclusions"), 'exVisitorId) Tsv(args("input"), visitScheme) .filter('country){country : String => country == "Israel"} .leftJoinWithTiny('visitorId -> exVisitorId, exclusions) .filter('exVisitorId){isEx : String => isEx null} .groupBy('section){_.sortWithTake(visitScheme -> 'top, )(biggerSale)} .flattenTo[visitType]{'top -> visitScheme} .write(Tsv(args("output"), visitScheme)) }
  • 4. Asking big data questions Which questions will you ask? What analysis will you do? A possible approach: Use the outliers to improve your product • Most popular products on your site • Visits that ended with the highest sale value
  • 5. Asking big data questions Which questions will you ask? What analysis will you do? A possible approach: Use the outliers to improve your product • Most popular products on your site • Visits that ended with the highest sale value That is the problem of finding the top elements in the data.
  • 6. Data analysis problem Top elements problem Input • Data – arranged in records • K – number of top elements or p – percentage of top elements to output • Order function – some ordering on the records Output • K top records of our data or top p percentage according to the order function
  • 7. Algorithm flow Read input records Top K elements problem Input = 13, 55, 8, 2, 34, 89, 21, 8 K=5 Sort records, take top K Output top records Output = 89, 55, 34, 21, 13
  • 8. Algorithm flow Read input records Top K elements problem Input = 13, 55, 8, 2, 34, 89, 21, 8 K=5 Sort records, take top K Output top records Output = 89, 55, 34, 21, 13 Scalding code Tsv(args("input"), 'item) .groupAll{_.sortWithTake('item -> 'top, (a : Int, b : Int) => a > b}} .write(Tsv(args("output"), 'top)) ){
  • 9. Algorithm flow Read input records Top K elements problem Sort records, take top K Output top records
  • 10. Algorithm flow Read input records Top K elements problem Filter records that fit target population Sort records, take top K Output top records
  • 11. Algorithm flow Read input records Top K elements problem Filter records that fit target population Divide to groups by site section Sort records, tak e top K Sort records, tak e top K Output top records Output top records
  • 12. Algorithm flow Read input records Top K elements problem Read exclusion list from external source Filter records that fit target population Filter out the visits from the exclusion list according to visitor id Divide to groups by site section Sort records, take top K Sort records, take top K Output top records Output top records
  • 13. MapReduce on Hadoop Block Mapper (k,v)  (k’1,v’1),(k’2,v’2)… HDFS Block n Mapper n (k,v)  (k’1,v’1),(k’2,v’2)… Output file Reducer (k', iterator(v')) v’’1, v’’2… Block Mapper (k,v)  (k’1,v’1),(k’2,v’2)… Reducer (k', iterator(v')) v’’1, v’’2… Output file
  • 14. MapReduce on Hadoop Big bottleneck Block Mapper (k,v)  (k’1,v’1),(k’2,v’2)… HDFS Block n Mapper n (k,v)  (k’1,v’1),(k’2,v’2)… Output file Reducer (k', iterator(v')) v’’1, v’’2… Block Mapper (k,v)  (k’1,v’1),(k’2,v’2)… Reducer (k', iterator(v')) v’’1, v’’2… Output file
  • 15. Efficient MapReduce Which tool should we use? Have built-in performanceoriginated features Efficient Execution Easy to alter And easy maintenance Full Functionality Fast Code Writing
  • 16. About Scalding Scalding is a Scala library that makes it easy to write MapReduce jobs in Hadoop. It's similar to other MapReduce platforms like Pig and Hive, but offers a higher level of abstraction by leveraging the full power of Scala and the JVM –Twitter
  • 17. Algorithm flow Read input records Top K elements problem Read exclusion list from external source Filter records that fit target population Filter out the visits from the exclusion list according to visitor id Divide to groups by site section Sort records, take top K Sort records, take top K Output top records Output top records
  • 18. Simple Scalding job val visitScheme = ('visitorId, 'country, 'section, 'saleValue) type visitType = (String, String, String, Double) def biggerSale(v1 : visitType, v2: visitType) = v1._4 > v2._4
  • 19. Simple Scalding job val visitScheme = ('visitorId, 'country, 'section, 'saleValue) type visitType = (String, String, String, Double) def biggerSale(v1 : visitType, v2: visitType) = v1._4 > v2._4 import com.twitter.scalding._ class TopKJob(args : Args) extends Job(args){ Tsv(args("input"), visitScheme) .filter('country){country : String => country == "Israel"} .groupBy('section){_.sortWithTake(visitScheme -> 'top, )(biggerSale)} .flattenTo[visitType]{'top -> visitScheme} .write(Tsv(args("output"), visitScheme)) }
  • 20. Simple Scalding job val visitScheme = ('visitorId, 'country, 'section, 'saleValue) type visitType = (String, String, String, Double) def biggerSale(v1 : visitType, v2: visitType) = v1._4 > v2._4 import com.twitter.scalding._ class TopKJob(args : Args) extends Job(args){ Tsv(args("input"), visitScheme) .filter('country){country : String => country == "Israel"} .groupBy('section){_.sortWithTake(visitScheme -> 'top, )(biggerSale)} .flattenTo[visitType]{'top -> visitScheme} .write(Tsv(args("output"), visitScheme)) }
  • 21. Simple Scalding job val visitScheme = ('visitorId, 'country, 'section, 'saleValue) type visitType = (String, String, String, Double) def biggerSale(v1 : visitType, v2: visitType) = v1._4 > v2._4 import com.twitter.scalding._ class TopKJob(args : Args) extends Job(args){ Tsv(args("input"), visitScheme) .filter('country){country : String => country == "Israel"} .groupBy('section){_.sortWithTake(visitScheme -> 'top, )(biggerSale)} .flattenTo[visitType]{'top -> visitScheme} .write(Tsv(args("output"), visitScheme)) }
  • 22. Simple Scalding job val visitScheme = ('visitorId, 'country, 'section, 'saleValue) type visitType = (String, String, String, Double) def biggerSale(v1 : visitType, v2: visitType) = v1._4 > v2._4 import com.twitter.scalding._ class TopKJob(args : Args) extends Job(args){ Tsv(args("input"), visitScheme) .filter('country){country : String => country == "Israel"} .groupBy('section){_.sortWithTake(visitScheme -> 'top, )(biggerSale)} .flattenTo[visitType]{'top -> visitScheme} .write(Tsv(args("output"), visitScheme)) }
  • 23. Simple Scalding job val visitScheme = ('visitorId, 'country, 'section, 'saleValue) type visitType = (String, String, String, Double) def biggerSale(v1 : visitType, v2: visitType) = v1._4 > v2._4 import com.twitter.scalding._ class TopKJob(args : Args) extends Job(args){ Tsv(args("input"), visitScheme) .filter('country){country : String => country == "Israel"} .groupBy('section){_.sortWithTake(visitScheme -> 'top, )(biggerSale)} .flattenTo[visitType]{'top -> visitScheme} .write(Tsv(args("output"), visitScheme)) }
  • 24. MapReduce joins We like to filter out the visits that appear in the exclusion list: visitorId 1 2 country Israel Israel section … … saleValue … … 3 Israel … … exVisitorId 3 1
  • 25. MapReduce joins We like to filter out the visits that appear in the exclusion list: visitorId 1 2 country Israel Israel section … … saleValue … … 3 Israel … exVisitorId 3 … visitorId country section saleValue 1 exVisitorId
  • 26. MapReduce joins We like to filter out the visits that appear in the exclusion list: visitorId 1 2 country Israel Israel section … … saleValue … … 3 Israel … exVisitorId 3 … 1 visitorId 1 country Israel section … saleValue … exVisitorId 1 2 3 Israel Israel … … … … null 3
  • 27. MapReduce joins We like to filter out the visits that appear in the exclusion list: visitorId 1 2 country Israel Israel section … … saleValue … … 3 Israel … exVisitorId 3 … 1 visitorId 1 country Israel section … saleValue … exVisitorId 1 2 3 Israel Israel … … … … null 3 visitorId country section saleValue exVisitorId 2 Israel … … null
  • 28. Simple Scalding job Filtering using JoinWithTiny: class TopKJob(args : Args) extends Job(args){ val exclusions = Tsv(args("exclusions"), 'exVisitorId) Tsv(args("input"), visitScheme) .filter('country){country : String => country == "Israel"} .leftJoinWithTiny('visitorId -> exVisitorId, exclusions) .filter('exVisitorId){isEx : String => isEx null} .groupBy('section){_.sortWithTake(visitScheme -> 'top, )(biggerSale)} .flattenTo[visitType]{'top -> visitScheme} .write(Tsv(args("output"), visitScheme)) }
  • 29. Simple Scalding job Filtering using JoinWithTiny: class TopKJob(args : Args) extends Job(args){ val exclusions = Tsv(args("exclusions"), 'exVisitorId) Tsv(args("input"), visitScheme) .filter('country){country : String => country == "Israel"} .leftJoinWithTiny('visitorId -> exVisitorId, exclusions) .filter('exVisitorId){isEx : String => isEx null} .groupBy('section){_.sortWithTake(visitScheme -> 'top, )(biggerSale)} .flattenTo[visitType]{'top -> visitScheme} .write(Tsv(args("output"), visitScheme)) }
  • 30. Simple Scalding job Functionality is complete What's next
  • 31. Efficient MapReduce Functionality is complete What's next Efficient Execution Full Functionality Fast Code Writing
  • 32. Efficient MapReduce MapReduce performance issues: 1. Traffic bottleneck between the mappers and the reduces. 2. Inefficient order of map and reduce steps.
  • 33. Efficient MapReduce MapReduce performance issues: 1. Traffic bottleneck between the mappers and the reduces. The traffic bottleneck is when we take the top K elements. • • We like to output from each mapper the top elements of its input. How is sortWithTake implemented?
  • 34. Efficient performance using Algebird sortWithTake uses: class PriorityQueueMonoid[T](max : Int)(implicit ord : Ordering[T]) extends Monoid[PriorityQueue[T]] Defined in: Algebird (Twitter): Abstract algebra for Scala, targeted at building aggregation systems.
  • 35. Efficient performance using Algebird sortWithTake uses: class PriorityQueueMonoid[T](max : Int)(implicit ord : Ordering[T]) extends Monoid[PriorityQueue[T]] PriorityQueue case: Empty PriorityQueue Two PriorityQueues can be added: K=5 Q1: values = 55, 34, 21, 13, 8 Q2: values = 100, 80, 60, 40, 20 Q1 plus Q2: values: 100, 80, 60, 55, 40 Associative and commutative
  • 36. Efficient performance using Algebird All Monoid aggregations can start in Map phase, then finish in Reduce phase. This decreases the amount of traffic from the mappers to the reducers. Performed implicitly when using Scalding built-in aggregation functions: average sum sizeAveStdev histogram approximateUniqueCount sortWithTake
  • 37. Improving performance Our second performance issue: What about the performance due to inefficient order of the map and reduce steps?
  • 38. Top elements problem revisited New problem definition: Output the percentage p of top elements instead of the fixed K top elements. What is K? K = p * count
  • 39. Top %p of elements algorithm flow Read input records What is K? K = p * count … Divide to groups by site section Count the number of records Count the number of records Sort records take top p Sort records take top p Output top records Output top records
  • 40. Top %p of elements scalding job class TopPJob(args : Args) extends Job(args){ // visitScheme after join with exclusion list val visits : RichPipe = … val counts = visits .groupBy('section){_.size('sectionSize)} .map('sectionSize -> 'sectionK){size : Int => {size * // taking top %p of elements visits.joinWithTiny('section -> 'section, counts) … } }.toInt}
  • 41. Flow graph How will this flow be executed on Hadoop? • How many MapReduce steps will be performed? • What will be the input to each step? • What logic will each contain?
  • 42. Flow graph How will this flow be executed on Hadoop? • How many MapReduce steps will be performed? • What will be the input to each step? • What logic will each contain? Run with --tool.graph!
  • 43. Flow graph Full flow in Cascading terminology
  • 44. Flow graph Split to counting Full flow in Cascading terminology Reading input, join with exclusion list Counting and calculating K Join with counting result Joining with K and sorting
  • 46. Flow graph And another graph: source source Step number Records input Exclusion list group Step number Records input Exclusion list group Output file sink First step Second step
  • 47. Flow graph Changing joining with exclusion list to be performed only once: val visits : RichPipe = … .project(visitScheme) .forceToDisk Only a single line is added! val counts = visits .groupBy('section){_.size('sectionSize)} … visits.joinWithTiny('section -> 'section, counts) …
  • 48. Flow graph The new map reduce steps: source Step number Records input Exclusion list Step number group sink Step number group Output file First step Second step Third step
  • 49. Improving performance We saw how: • Writing Scalding jobs is simple, intuitive and fast. • We can use external resources to improve the performance of our algorithms. Scalding performs some of this job implicitly for us. • We can use Cascading library Scalding built on to understand what are the exact steps that will run.
  • 50. Additional features Some other features in Scalding • Typed API TypedTsv[visitType](args("input")) .filter(_._2 == "Israel") .toPipe(visitScheme) .toTypedPipe[visitType](visitScheme) // TypedPipe[visitType] // TypedPipe[visitType] • Testing using JobTest Give the input and get the output as Lists • Matrix API Useful for running graph algorithms such as PageRank
  • 51. Scalding in LivePerson How do we use Scalding in LivePerson? • The main tool in the Data Science team • Both for quick data exploration, and in production jobs