SlideShare ist ein Scribd-Unternehmen logo
1 von 22
Some tips for effective
map reducing
CHRISTOPHER SEVERS
eBay
eBay Netanya
December 2nd, 2013
THE AGENDA
THE AGENDA
1. Quick survey of the current landscape for Hadoop
tools
2. A light comparison of the best functional tools.
3. General advice
4. Some code samples

PRESENTATION TITLE GOES HERE

3
THE ALTERNATIVES
I promise this part will be quick
VANILLA MAPREDUCE
package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount {

public static class Reduce extends
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context
context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();

public static class Map extends
Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}

Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);

}

PRESENTATION TITLE GOES HERE

5
PIG
•Apache Pig is a really great tool for quick, ad-hoc data
analysis
•While we can do amazing things with it, I’m not sure we
should
•Anything complicated requires User Defined Functions
(UDFs)
•UDFs require a separate code base
•Now you have to maintain two separate languages for
no good reason

PRESENTATION TITLE GOES HERE

6
APACHE HIVE
•On previous slide: s/Pig/Hive/g

PRESENTATION TITLE GOES HERE

7
GENERAL ADVICE
Do this, not that
DO
•Use a higher level abstraction like distributed lists
•Use objects instead of tuples
•Use a good serialization format
•Always check for data quality
•Use flatMap for uncertain computations
•Develop reusable reductions (monoids!)
•Prefer map side operations when possible
•Always check for data skew

PRESENTATION TITLE GOES HERE

9
DON’T
•Never use nulls
•Don’t use too many levels of nesting
•Don’t use shared state
•Don’t use iteration (too much)
•Try not to start with a complicated approach

PRESENTATION TITLE GOES HERE

10
SCALDING AND
SCOOBI
This is what we use at eBay
SOME SCALA CODE
val myLines = getStuff
val myWords = myLines.flatMap(w =>
w.split("s+"))
val myWordsGrouped = myLines.groupBy(identity)
val countedWords = myWordsGrouped.
mapValues(x=>x.size)
write(countedWords)

PRESENTATION TITLE GOES HERE

12
SOME SCALDING CODE
val myLines = TextLine(path)
val myWords= myLines.flatMap(w =>
w.split(" "))
.groupBy(identity)
.size
myWords.write(TypedTSV(output))

PRESENTATION TITLE GOES HERE

13
WHAT HAPPENED ON THE PREVIOUS SLIDE?
•flatMap()
–Similar to map, but a one-to-many rather than one-to-one
mapping
–Use when the desired result has some probability of
occurring
–Can handle errors with the Option (Maybe) monad. A None
type will be discarded

PRESENTATION TITLE GOES HERE

14
MORE EXPLANATION
•groupBy()
–Takes a function that generates a key from the given value
–Logically the result can be thought of as an associative
array: key -> List of values
–In Scalding this doesn’t necessarily force a Hadoop reduce
phase, it depends on what comes after

PRESENTATION TITLE GOES HERE

15
THE BEST PART
•size
–This part is pure magic
–size is actually sugar for .map( t => 1L).sum
–sum has an implicit argument, mon: Monoid[T]

PRESENTATION TITLE GOES HERE

16
MONOIDS: WHY YOU SHOULD CARE ABOUT
MATH
•From Wikipedia:
–a monoid is an algebraic structure with a single associative
binary operation and an identity element.

•Almost everything you want to do is a monoid
–Standard addition of numeric types is the most common
–List/map/set/string concatenation
–Top k elements
–Bloom filter, count-min sketch, hyperloglog
–stochastic gradient descent
–histograms

PRESENTATION TITLE GOES HERE

17
MORE MONOID STUFF
•If you are aggregating, you are probably using a monoid
•Scalding has Algebird and monoid support baked in
•Scoobi can use Algebird (or any other monoid library)
with almost no work
–combine { case (l,r) => monoid.plus(l,r) }

•Algebird handles tuples with ease
•Very easy to define monoids for your own types

PRESENTATION TITLE GOES HERE

18
ADVANTAGES
•Type checking
–Find errors at compile time, not at job submission time (or
even worse, 5 hours after job submission time)

•Single language
–Scala is a full programming language

•Productivity
–Since the code you write looks like collections code you can
use the Scala REPL to prototype

•Clarity
–Write code as a series of operations and let the job planner
smash it all together
PRESENTATION TITLE GOES HERE

19
CONCLUSION
We’re almost done!
THINGS TO TAKE AWAY
•Mapreduce is a functional problem, we should use
functional tools
•You can increase productivity, safety, and maintainability
all at once with no down side
•Thinking of data flows in a functional way opens up
many new possibilities
•The community is awesome
THANKS!
•Questions/comments?

PRESENTATION TITLE GOES HERE

22

Weitere ähnliche Inhalte

Was ist angesagt?

Reflection in Pharo: Beyond Smalltak
Reflection in Pharo: Beyond SmalltakReflection in Pharo: Beyond Smalltak
Reflection in Pharo: Beyond SmalltakMarcus Denker
 
Analysis of Zinc (nescala 2020)
Analysis of Zinc (nescala 2020)Analysis of Zinc (nescala 2020)
Analysis of Zinc (nescala 2020)Eugene Yokota
 
Reflection in Pharo: Beyond Smalltak
Reflection in Pharo: Beyond SmalltakReflection in Pharo: Beyond Smalltak
Reflection in Pharo: Beyond SmalltakMarcus Denker
 
Scala Code Analysis at Codacy
Scala Code Analysis at CodacyScala Code Analysis at Codacy
Scala Code Analysis at CodacyJohann Oikonomou
 
Analysis of Zinc (ScalaSphere 2019)
Analysis of Zinc (ScalaSphere 2019)Analysis of Zinc (ScalaSphere 2019)
Analysis of Zinc (ScalaSphere 2019)Eugene Yokota
 
I18nize Scala programs à la gettext
I18nize Scala programs à la gettextI18nize Scala programs à la gettext
I18nize Scala programs à la gettextNgoc Dao
 
A Scala Corrections Library
A Scala Corrections LibraryA Scala Corrections Library
A Scala Corrections LibraryPaul Phillips
 
#Pharo Days 2016 Reflectivity
#Pharo Days 2016 Reflectivity#Pharo Days 2016 Reflectivity
#Pharo Days 2016 ReflectivityPhilippe Back
 

Was ist angesagt? (14)

Reflection in Pharo: Beyond Smalltak
Reflection in Pharo: Beyond SmalltakReflection in Pharo: Beyond Smalltak
Reflection in Pharo: Beyond Smalltak
 
Dutch hug
Dutch hugDutch hug
Dutch hug
 
Analysis of Zinc (nescala 2020)
Analysis of Zinc (nescala 2020)Analysis of Zinc (nescala 2020)
Analysis of Zinc (nescala 2020)
 
Reflection in Pharo: Beyond Smalltak
Reflection in Pharo: Beyond SmalltakReflection in Pharo: Beyond Smalltak
Reflection in Pharo: Beyond Smalltak
 
Scala Code Analysis at Codacy
Scala Code Analysis at CodacyScala Code Analysis at Codacy
Scala Code Analysis at Codacy
 
Realm database
Realm databaseRealm database
Realm database
 
Analysis of Zinc (ScalaSphere 2019)
Analysis of Zinc (ScalaSphere 2019)Analysis of Zinc (ScalaSphere 2019)
Analysis of Zinc (ScalaSphere 2019)
 
I18nize Scala programs à la gettext
I18nize Scala programs à la gettextI18nize Scala programs à la gettext
I18nize Scala programs à la gettext
 
Introduction to java 8 stream api
Introduction to java 8 stream apiIntroduction to java 8 stream api
Introduction to java 8 stream api
 
Elm kyivfprog 2015
Elm kyivfprog 2015Elm kyivfprog 2015
Elm kyivfprog 2015
 
A Scala Corrections Library
A Scala Corrections LibraryA Scala Corrections Library
A Scala Corrections Library
 
Introduction to Elm
Introduction to ElmIntroduction to Elm
Introduction to Elm
 
Scala in Practice
Scala in PracticeScala in Practice
Scala in Practice
 
#Pharo Days 2016 Reflectivity
#Pharo Days 2016 Reflectivity#Pharo Days 2016 Reflectivity
#Pharo Days 2016 Reflectivity
 

Ähnlich wie Presentation on functional data mining at the IGT Cloud meet up at eBay Netanya

Develop realtime web with Scala and Xitrum
Develop realtime web with Scala and XitrumDevelop realtime web with Scala and Xitrum
Develop realtime web with Scala and XitrumNgoc Dao
 
ScalaDays 2013 Keynote Speech by Martin Odersky
ScalaDays 2013 Keynote Speech by Martin OderskyScalaDays 2013 Keynote Speech by Martin Odersky
ScalaDays 2013 Keynote Speech by Martin OderskyTypesafe
 
Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? DataWorks Summit
 
Google guava overview
Google guava overviewGoogle guava overview
Google guava overviewSteve Min
 
Scala final ppt vinay
Scala final ppt vinayScala final ppt vinay
Scala final ppt vinayViplav Jain
 
Scalding: Twitter's New DSL for Hadoop
Scalding: Twitter's New DSL for HadoopScalding: Twitter's New DSL for Hadoop
Scalding: Twitter's New DSL for HadoopDataWorks Summit
 
Scalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/CascadingScalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/Cascadingjohnynek
 
Scala in the Wild
Scala in the WildScala in the Wild
Scala in the WildTomer Gabel
 
Why hadoop map reduce needs scala, an introduction to scoobi and scalding
Why hadoop map reduce needs scala, an introduction to scoobi and scaldingWhy hadoop map reduce needs scala, an introduction to scoobi and scalding
Why hadoop map reduce needs scala, an introduction to scoobi and scaldingXebia Nederland BV
 
Design for Scalability in ADAM
Design for Scalability in ADAMDesign for Scalability in ADAM
Design for Scalability in ADAMfnothaft
 
Advanced JavaScript Development
Advanced JavaScript DevelopmentAdvanced JavaScript Development
Advanced JavaScript DevelopmentJussi Pohjolainen
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...DB Tsai
 
What Makes Objective C Dynamic?
What Makes Objective C Dynamic?What Makes Objective C Dynamic?
What Makes Objective C Dynamic?Kyle Oba
 
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...rhatr
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceobdit
 
"Xapi-lang For declarative code generation" By James Nelson
"Xapi-lang For declarative code generation" By James Nelson"Xapi-lang For declarative code generation" By James Nelson
"Xapi-lang For declarative code generation" By James NelsonGWTcon
 
Java Hands-On Workshop
Java Hands-On WorkshopJava Hands-On Workshop
Java Hands-On WorkshopArpit Poladia
 
Introduction to Kotlin Language and its application to Android platform
Introduction to Kotlin Language and its application to Android platformIntroduction to Kotlin Language and its application to Android platform
Introduction to Kotlin Language and its application to Android platformEastBanc Tachnologies
 

Ähnlich wie Presentation on functional data mining at the IGT Cloud meet up at eBay Netanya (20)

Develop realtime web with Scala and Xitrum
Develop realtime web with Scala and XitrumDevelop realtime web with Scala and Xitrum
Develop realtime web with Scala and Xitrum
 
ScalaDays 2013 Keynote Speech by Martin Odersky
ScalaDays 2013 Keynote Speech by Martin OderskyScalaDays 2013 Keynote Speech by Martin Odersky
ScalaDays 2013 Keynote Speech by Martin Odersky
 
Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch?
 
Google guava overview
Google guava overviewGoogle guava overview
Google guava overview
 
Scala final ppt vinay
Scala final ppt vinayScala final ppt vinay
Scala final ppt vinay
 
Scalding: Twitter's New DSL for Hadoop
Scalding: Twitter's New DSL for HadoopScalding: Twitter's New DSL for Hadoop
Scalding: Twitter's New DSL for Hadoop
 
Scalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/CascadingScalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/Cascading
 
Scala in the Wild
Scala in the WildScala in the Wild
Scala in the Wild
 
Why hadoop map reduce needs scala, an introduction to scoobi and scalding
Why hadoop map reduce needs scala, an introduction to scoobi and scaldingWhy hadoop map reduce needs scala, an introduction to scoobi and scalding
Why hadoop map reduce needs scala, an introduction to scoobi and scalding
 
Design for Scalability in ADAM
Design for Scalability in ADAMDesign for Scalability in ADAM
Design for Scalability in ADAM
 
Should i Go there
Should i Go thereShould i Go there
Should i Go there
 
Advanced JavaScript Development
Advanced JavaScript DevelopmentAdvanced JavaScript Development
Advanced JavaScript Development
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
 
What Makes Objective C Dynamic?
What Makes Objective C Dynamic?What Makes Objective C Dynamic?
What Makes Objective C Dynamic?
 
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduce
 
"Xapi-lang For declarative code generation" By James Nelson
"Xapi-lang For declarative code generation" By James Nelson"Xapi-lang For declarative code generation" By James Nelson
"Xapi-lang For declarative code generation" By James Nelson
 
Java Hands-On Workshop
Java Hands-On WorkshopJava Hands-On Workshop
Java Hands-On Workshop
 
Introduction to Kotlin Language and its application to Android platform
Introduction to Kotlin Language and its application to Android platformIntroduction to Kotlin Language and its application to Android platform
Introduction to Kotlin Language and its application to Android platform
 
Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14
 

Kürzlich hochgeladen

Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 

Kürzlich hochgeladen (20)

Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 

Presentation on functional data mining at the IGT Cloud meet up at eBay Netanya

  • 1. Some tips for effective map reducing CHRISTOPHER SEVERS eBay eBay Netanya December 2nd, 2013
  • 3. THE AGENDA 1. Quick survey of the current landscape for Hadoop tools 2. A light comparison of the best functional tools. 3. General advice 4. Some code samples PRESENTATION TITLE GOES HERE 3
  • 4. THE ALTERNATIVES I promise this part will be quick
  • 5. VANILLA MAPREDUCE package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class WordCount { public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } PRESENTATION TITLE GOES HERE 5
  • 6. PIG •Apache Pig is a really great tool for quick, ad-hoc data analysis •While we can do amazing things with it, I’m not sure we should •Anything complicated requires User Defined Functions (UDFs) •UDFs require a separate code base •Now you have to maintain two separate languages for no good reason PRESENTATION TITLE GOES HERE 6
  • 7. APACHE HIVE •On previous slide: s/Pig/Hive/g PRESENTATION TITLE GOES HERE 7
  • 9. DO •Use a higher level abstraction like distributed lists •Use objects instead of tuples •Use a good serialization format •Always check for data quality •Use flatMap for uncertain computations •Develop reusable reductions (monoids!) •Prefer map side operations when possible •Always check for data skew PRESENTATION TITLE GOES HERE 9
  • 10. DON’T •Never use nulls •Don’t use too many levels of nesting •Don’t use shared state •Don’t use iteration (too much) •Try not to start with a complicated approach PRESENTATION TITLE GOES HERE 10
  • 11. SCALDING AND SCOOBI This is what we use at eBay
  • 12. SOME SCALA CODE val myLines = getStuff val myWords = myLines.flatMap(w => w.split("s+")) val myWordsGrouped = myLines.groupBy(identity) val countedWords = myWordsGrouped. mapValues(x=>x.size) write(countedWords) PRESENTATION TITLE GOES HERE 12
  • 13. SOME SCALDING CODE val myLines = TextLine(path) val myWords= myLines.flatMap(w => w.split(" ")) .groupBy(identity) .size myWords.write(TypedTSV(output)) PRESENTATION TITLE GOES HERE 13
  • 14. WHAT HAPPENED ON THE PREVIOUS SLIDE? •flatMap() –Similar to map, but a one-to-many rather than one-to-one mapping –Use when the desired result has some probability of occurring –Can handle errors with the Option (Maybe) monad. A None type will be discarded PRESENTATION TITLE GOES HERE 14
  • 15. MORE EXPLANATION •groupBy() –Takes a function that generates a key from the given value –Logically the result can be thought of as an associative array: key -> List of values –In Scalding this doesn’t necessarily force a Hadoop reduce phase, it depends on what comes after PRESENTATION TITLE GOES HERE 15
  • 16. THE BEST PART •size –This part is pure magic –size is actually sugar for .map( t => 1L).sum –sum has an implicit argument, mon: Monoid[T] PRESENTATION TITLE GOES HERE 16
  • 17. MONOIDS: WHY YOU SHOULD CARE ABOUT MATH •From Wikipedia: –a monoid is an algebraic structure with a single associative binary operation and an identity element. •Almost everything you want to do is a monoid –Standard addition of numeric types is the most common –List/map/set/string concatenation –Top k elements –Bloom filter, count-min sketch, hyperloglog –stochastic gradient descent –histograms PRESENTATION TITLE GOES HERE 17
  • 18. MORE MONOID STUFF •If you are aggregating, you are probably using a monoid •Scalding has Algebird and monoid support baked in •Scoobi can use Algebird (or any other monoid library) with almost no work –combine { case (l,r) => monoid.plus(l,r) } •Algebird handles tuples with ease •Very easy to define monoids for your own types PRESENTATION TITLE GOES HERE 18
  • 19. ADVANTAGES •Type checking –Find errors at compile time, not at job submission time (or even worse, 5 hours after job submission time) •Single language –Scala is a full programming language •Productivity –Since the code you write looks like collections code you can use the Scala REPL to prototype •Clarity –Write code as a series of operations and let the job planner smash it all together PRESENTATION TITLE GOES HERE 19
  • 21. THINGS TO TAKE AWAY •Mapreduce is a functional problem, we should use functional tools •You can increase productivity, safety, and maintainability all at once with no down side •Thinking of data flows in a functional way opens up many new possibilities •The community is awesome