2. Me
Haskell DSL for development of computer
Machine learning, software
vision algorithms targeting GPUs
systems, computer
vision, optimisation, networks, control
and signal processing
Predictive analytics for the enterprise
3. Hadoop app development – wish list
Quick dev cycles
Expressive
Reusability
Type safety
Reliability
4. Bridging the “tooling” gap
Scoobi
MapReduce
pipelines
DList
DObject
ScalaCheck
Implementation Testing
Java APIs
HadoopMapReduce
5. At a glance
• Scoobi = Scala for Hadoop
• Inspired by Google’s FlumeJava
• Developed at NICTA
• Open-sourced Oct 2011
• Apache V2
7. Java style
public class WordCount { public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
public static class Map extends Mapper<LongWritable, Text, Text,
IntWritable> { Job job = new Job(conf, "wordcount");
private final static IntWritable one = new IntWritable(1);
private Text word = new Text(); job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
public void map(LongWritable key, Text value, Context
context) throws IOException, InterruptedException { job.setMapperClass(Map.class);
String line = value.toString(); job.setReducerClass(Reduce.class);
StringTokenizertokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) { job.setInputFormatClass(TextInputFormat.class);
word.set(tokenizer.nextToken()); job.setOutputFormatClass(TextOutputFormat.class);
context.write(word, one);
} FileInputFormat.addInputPath(job, new Path(args[0]));
} FileOutputFormat.setOutputPath(job, new Path(args[1]));
}
public static class Reduce extends Reducer<Text, IntWritable, job.waitForCompletion(true);
Text, IntWritable> {
}
}
public void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritableval : values) {
Source: http://wiki.apache.org/hadoop/WordCount
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
7
8. DList abstraction
Distributed List (DList)
Data
on
HDFS
Transform
DList type Abstraction for
DList[String] Lines of text files
DList[(Int, String, Boolean)] CVS files of the form “37,Joe,M”
DList[(Float,Map[(String, Int)]] Avro files withshema: {record { int, map}}
9. Scoobi style
importcom.nicta.scoobi.Scoobi._
// Count the frequency of words from corpus of documents
objectWordCountextendsScoobiApp {
def run() {
vallines: DList[String] = fromTextFile(args(0))
valfreqs: DList[(String, Int)] =
lines.flatMap(_.split(" ")) // DList[String]
.map(w=> (w, 1)) // DList[(String, Int)]
.groupByKey// DList[(String, Iterable[Int])]
.combine(_+_) // DList[(String, Int)]
persist(toTextFile(freqs, args(1)))
}
}
10. DList trait
traitDList[A] {
/* Abstract methods */
def parallelDo[B](dofn: DoFn[A, B]): DList[B]
def ++(that: DList[A]): DList[A]
def groupByKey[K, V]
(implicit A <:< (K, V)): DList[(K, Iterable[V])]
def combine[K, V]
(f: (V, V) => V)
(implicit A <:< (K, Iterable[V])): DList[(K, V)]
/* All other methods are derived, e.g. „map‟ */
}
11. Under the hood
fromTextFile LD
lines HDFS
flatMap PD
words
map PD MapReduce Job
word1
groupByKey GBK
wordG HDFS
combine CV
freq
persist
12. Removing less than the average
importcom.nicta.scoobi.Scoobi._
// Remove all integers that are less than the average integer
objectBetterThanAverageextendsScoobiApp {
def run() {
valints: DList[Int] =
fromTextFile(args(0)) collect { case AnInt(i) =>i }
valtotal: DObject[Int] = ints.sum
valcount: DObject[Int] = ints.size
valaverage: DObject[Int] =
(total, count) map { case (t, c) =>t / c }
valbigger: DList[Int] =
(average join ints) filter { case (a, i) =>i> a }
persist(toTextFile(bigger, args(1)))
}
}
13. Under the hood HDFS
LD
MapReduce Job
ints PD
PD PD
HDFS
GBK GBK
CV CV Client computation
M M
DCach
total count HDFS
OP e
average
PD MapReduce Job
PD
bigger
HDFS
15. Mirroring the Scala Collection API
DList =>DList DList =>DObject
flatMap reduce
map product
filter sum
filterNot length
groupBy size
partition count
flatten max
distinct maxBy
++ min
keys, values minBy
16. Building abstractions
Functional programming
Functions as Functions as
procedures parameters
Composability
+
Reusability
17. Composing
// Compute the average of a DList of “numbers”
defaverage[A : Numeric](in: DList[A]): DObject[A] =
(in.sum, in.size) map { case (sum, size) => sum / size }
// Compute histogram
defhistogram[A](in: DList[A]): DList[(A, Int)] =
in.map(x=> (x, 1)).groupByKey.combine(_+_)
// Throw away words with less-than-average frequency
defbetterThanAvgWords(lines: DList[String]): DList[String] = {
val words = lines.flatMap(_.split(“ “))
valwordCnts = histogram(words)
valavgFreq = average(wordCounts.values)
(avgFreq join wordCnts) collect { case (avg, (w, f)) iff>avg=>w }
}
18. Unit-testing ‘histogram’
// Specification for histogram function
class HistogramSpecextendsHadoopSpecification {
“Histogram from DList”>> {
ScalaCheck
“Sum of bins must equal size of DList”>> { implicitc: SC=>
Prop.forAll { list: List[Int] =>
valhist = histogram(list.toDList)
valbinSum = persist(hist.values.sum)
binSum == list.sz
}
}
“Number of bins must equal number of unique values”>> { implicitc: SC=>
Prop.forAll { list: List[Int] =>
val input = list.toDList
val bins = histogram(input).keys.size
valuniques = input.distinct.size
val (b, u) = persist(bins, uniques)
b == u
}
}
}
}
19. sbt integration
> test-only *Histogram* -- exclude cluster
[info] HistogramSpec
[info]
[info] Histogram from DList
[info] + Sum of bins must equal size of DList
[info] No cluster execution time
[info] + Number of bins must equal number of unique values
[info] No cluster execution time
[info]
[info]
[info] Total for specification BoundedFilterSpec
[info] Finished in 12 seconds, 600 ms
[info] 2 examples, 4 expectations, 0 failure, 0 error Dependent JARs are
[info] copied (once) to a
[info] Passed: : Total 2, Failed 0, Errors 0, Passed 2, Skipped 0 directory on the cluster
> (~/libjars by default)
> test-only *Histogram*
> test-only *Histogram* -- scoobi verbose
> test-only *Histogram* -- scoobiverbose.warning
20. Other features
• Grouping:
– API for controlling Hadoop’s sort-and-shuffle
– Useful for implementing secondary sorting
• Join and Co-group helper methods
• Matrix multiplication utilities
• I/O:
– Text, sequence, Avro
– Roll your own
21. Want to know more?
• http://nicta.github.com/scoobi
• Mailing lists:
– http://groups.google.com/group/scoobi-users
– http://groups.google.com/group/scoobi-dev
• Twitter:
– @bmlever
– @etorreborre
• Meet me:
– Will also be at Hadoop Summit (June 13-14)
– Keen to get feedback