3. THE AGENDA
1. Quick survey of the current landscape for Hadoop
tools
2. A light comparison of the best functional tools.
3. General advice
4. Some code samples
PRESENTATION TITLE GOES HERE
3
5. VANILLA MAPREDUCE
package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount {
public static class Reduce extends
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context
context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
public static class Map extends
Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
PRESENTATION TITLE GOES HERE
5
6. PIG
•Apache Pig is a really great tool for quick, ad-hoc data
analysis
•While we can do amazing things with it, I’m not sure we
should
•Anything complicated requires User Defined Functions
(UDFs)
•UDFs require a separate code base
•Now you have to maintain two separate languages for
no good reason
PRESENTATION TITLE GOES HERE
6
9. DO
•Use a higher level abstraction like distributed lists
•Use objects instead of tuples
•Use a good serialization format
•Always check for data quality
•Use flatMap for uncertain computations
•Develop reusable reductions (monoids!)
•Prefer map side operations when possible
•Always check for data skew
PRESENTATION TITLE GOES HERE
9
10. DON’T
•Never use nulls
•Don’t use too many levels of nesting
•Don’t use shared state
•Don’t use iteration (too much)
•Try not to start with a complicated approach
PRESENTATION TITLE GOES HERE
10
12. SOME SCALA CODE
val myLines = getStuff
val myWords = myLines.flatMap(w =>
w.split("s+"))
val myWordsGrouped = myLines.groupBy(identity)
val countedWords = myWordsGrouped.
mapValues(x=>x.size)
write(countedWords)
PRESENTATION TITLE GOES HERE
12
13. SOME SCALDING CODE
val myLines = TextLine(path)
val myWords= myLines.flatMap(w =>
w.split(" "))
.groupBy(identity)
.size
myWords.write(TypedTSV(output))
PRESENTATION TITLE GOES HERE
13
14. WHAT HAPPENED ON THE PREVIOUS SLIDE?
•flatMap()
–Similar to map, but a one-to-many rather than one-to-one
mapping
–Use when the desired result has some probability of
occurring
–Can handle errors with the Option (Maybe) monad. A None
type will be discarded
PRESENTATION TITLE GOES HERE
14
15. MORE EXPLANATION
•groupBy()
–Takes a function that generates a key from the given value
–Logically the result can be thought of as an associative
array: key -> List of values
–In Scalding this doesn’t necessarily force a Hadoop reduce
phase, it depends on what comes after
PRESENTATION TITLE GOES HERE
15
16. THE BEST PART
•size
–This part is pure magic
–size is actually sugar for .map( t => 1L).sum
–sum has an implicit argument, mon: Monoid[T]
PRESENTATION TITLE GOES HERE
16
17. MONOIDS: WHY YOU SHOULD CARE ABOUT
MATH
•From Wikipedia:
–a monoid is an algebraic structure with a single associative
binary operation and an identity element.
•Almost everything you want to do is a monoid
–Standard addition of numeric types is the most common
–List/map/set/string concatenation
–Top k elements
–Bloom filter, count-min sketch, hyperloglog
–stochastic gradient descent
–histograms
PRESENTATION TITLE GOES HERE
17
18. MORE MONOID STUFF
•If you are aggregating, you are probably using a monoid
•Scalding has Algebird and monoid support baked in
•Scoobi can use Algebird (or any other monoid library)
with almost no work
–combine { case (l,r) => monoid.plus(l,r) }
•Algebird handles tuples with ease
•Very easy to define monoids for your own types
PRESENTATION TITLE GOES HERE
18
19. ADVANTAGES
•Type checking
–Find errors at compile time, not at job submission time (or
even worse, 5 hours after job submission time)
•Single language
–Scala is a full programming language
•Productivity
–Since the code you write looks like collections code you can
use the Scala REPL to prototype
•Clarity
–Write code as a series of operations and let the job planner
smash it all together
PRESENTATION TITLE GOES HERE
19
21. THINGS TO TAKE AWAY
•Mapreduce is a functional problem, we should use
functional tools
•You can increase productivity, safety, and maintainability
all at once with no down side
•Thinking of data flows in a functional way opens up
many new possibilities
•The community is awesome