This is an quick introduction to Scalding and Monoids. Scalding is a Scala library that makes writing MapReduce jobs very easy. Monoids on the other hand promise parallelism and quality and they make some more challenging algorithms look very easy.
The talk was held at the Helsinki Data Science meetup on January 9th 2014.
2. Map
Reduce
• Programming
model
for
processing
large
data
sets
with
a
parallel,
distributed
algorithm
on
a
cluster.
• Inspired
by
map
and
reduce
func)ons
commonly
found
in
func)onal
programming
languages
• map()
performs
transla)ons
and
filtering
on
given
values
• reduce()
performs
summary
opera)on
on
given
values
3. How
does
it
work?
Found
this
from
the
Internet,
forgot
from
where
4. The
scene
• Hadoop
–
open
source
implementa)on
of
Google’s
MapReduce
and
Google
File
System
papers
• Java…
• Higher
level
frameworks/plaOorms
– Hive
≈
SQL
– Pig
(procedural
≈
“more
programming
than
SQL”)
– Cascading
–
Java
MR
applica)on
framework
for
enterprise
data
flows
• If
you
must
do
Java,
do
this!
– Scalding
-‐
Scala
DSL
for
Cascading,
easy
to
pick
up
yet
very
powerful
– Cascalog
–
Clojure
DSL
for
Cascading,
declara)ve,
logic
programming
5. The
scene
(*)
*
Borrowed
from
excellent
presenta)on
by
Vitaly
Gordon
and
Christopher
Severs
6. “Hadoop
is
a
distributed
system
for
coun)ng
words”
package
org.myorg;
import
java.io.IOException;
import
java.util.*;
import
org.apache.hadoop.fs.Path;
import
org.apache.hadoop.conf.*;
import
org.apache.hadoop.io.*;
import
org.apache.hadoop.mapred.*;
import
org.apache.hadoop.util.*;
public
class
WordCount
{
public
static
class
Map
extends
MapReduceBase
implements
Mapper<LongWritable,
Text,
Text,
IntWritable>
{
private
final
static
IntWritable
one
=
new
IntWritable(1);
private
Text
word
=
new
Text();
public
void
map(LongWritable
key,
Text
value,
OutputCollector<Text,
IntWritable>
output,
Reporter
reporter)
throws
IOException
{
String
line
=
value.toString();
StringTokenizer
tokenizer
=
new
StringTokenizer(line);
while
(tokenizer.hasMoreTokens())
{
word.set(tokenizer.nextToken());
output.collect(word,
one);
}
}
}
public
static
class
Reduce
extends
MapReduceBase
implements
Reducer<Text,
IntWritable,
Text,
IntWritable>
{
public
void
reduce(Text
key,
Iterator<IntWritable>
values,
OutputCollector<Text,
IntWritable>
output,
Reporter
reporter)
throws
IOException
{
int
sum
=
0;
while
(values.hasNext())
{
sum
+=
values.next().get();
}
output.collect(key,
new
IntWritable(sum));
}
}
public
static
void
main(String[]
args)
throws
Exception
{
JobConf
conf
=
new
JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf,
new
Path(args[0]));
FileOutputFormat.setOutputPath(conf,
new
Path(args[1]));
JobClient.runJob(conf);
}
}
7. What
do
we
actually
want
to
do?
Documents
(lines)
Tokenize
GroupBy
(token)
Count
Word
count
8. Word
Count
in
Scalding
• asd
package
com.sanoma.cda.examples
import
com.twitter.scalding._
class
WordCount1(args:
Args)
extends
Job(args)
{
TextLine(args("input"))
.flatMap('line
-‐>
'word)
{
line:
String
=>
line.split("s+")
}
.groupBy('word)
{
_.size
}
.write(Tsv(args("output")))
}
There
is
scald.rb
to
get
you
started
(get
it
from
Github
project)
Building
and
running
a
fat
jar
(for
local,
include
hadoop,
for
cluster
mark
it
“provided”):
> sbt assembly
> java -jar target/scala-2.10/scalding_talk-assembly-0.1.jar
com.sanoma.cda.examples.WordCount1 --local
--input data/11.txt.utf-8 --output wc.txt
> hadoop jar job-jars/scalding_talk-assembly-0.1.jar
--Dmapred.reduce.tasks=70 com.sanoma.cda.examples.WordCount1 --hdfs
--input /data/AliceInWonderland --output /user/Alice_wc
the
and
to
a
of
she
said
in
it
was
you
I
as
that
Alice
…
Alice,
Alice.
Alice;
Alice's
Alice:
(Alice
Alice!
Alice,)
1664
1172
780
773
662
596
484
416
401
356
329
301
260
246
226
221
76
54
16
11
7
4
3
2
9. Word
Count
in
Scalding
• asd
package
com.sanoma.cda.examples
import
com.twitter.scalding._
class
WordCount2(args:
Args)
extends
Job(args)
{
TextLine(args("input"))
.flatMap('line
-‐>
'word)
{
line:
String
=>
tokenize(line)
}
.filter('word)
{
word:
String
=>
word
!=
""
}
.groupBy('word)
{
_.size
}
.groupAll{
_.sortBy(('size,
'word)).reverse
}
//
this
is
just
for
easy
results
.write(Tsv(args("output")))
def
tokenize(text:
String):
Array[String]
=
{
text.toLowerCase.replaceAll("[^a-‐z0-‐9s]",
"").split("s+")
}
}
the
and
to
a
of
it
she
said
you
in
i
alice
was
that
as
her
with
at
on
all
1804
912
801
684
625
541
538
462
429
428
400
385
358
291
272
248
228
224
204
197
10. Word
count
in
Scalding
Almost
1-‐to-‐1
rela)on
between
the
process
and
the
Scalding
code!
UDFs
directly
in
Scala
And
Java
libraries
can
be
used
Documents
(lines)
Tokenize
package
com.sanoma.cda.examples
import
com.twitter.scalding._
class
WordCount2(args:
Args)
extends
Job(args)
{
TextLine(args("input"))
.flatMap('line
-‐>
'word)
{
tokenize
}
.groupBy('word)
{
_.size
}
.write(Tsv(args("output")))
def
tokenize(text:
String):
Array[String]
=
{
text.toLowerCase.replaceAll("[^a-‐z0-‐9s]",
"").split("s+")
}
}
GroupBy
(token)
Count
Word
count
11. About
Scalding
• Started
at
Twiper
–
years
of
produc)on
use
• Well
tested
and
op)mized
by
different
teams,
including
Twiper,
Concurrent
Inc.,
Etsy,
…
• Has
very
fast
local
mode
(no
need
to
install
Hadoop
locally)
• Flow
planner
is
designed
to
be
portable
à
in
future,
the
same
jobs
might
run
on
Storm
cluster
for
example
• Scala…
very
nice
programming
language
–
YMMV
– Func)onal
&
object
oriented,
has
REPL
12. Scalding
Func)ons
• 3
APIs:
– Fields-‐based
API
–
easy
to
start
from
here
– Type-‐safe
API
– Matrix
API
• Field-‐based
API
– Map-‐like
func)ons
• map,
flatMap,
project,
insert,
filter,
limit…
– Grouping/reducing
func)ons
• groupBy,
groupAll
• .size,
.sum,
.average,
.sizeAveStdev,
.toList,
.max,
sortBy,
.reduce,
.foldLeu,
.pivot,
…
– Join
Opera)ons
• joinWithSmaller,
joinWithLarger,
joinWithTiny,
crossWithTiny
• InnerJoin,
LeuJoin,
RightJoin,
OuterJoin
13. Scalding
matrix
API
package
com.twitter.scalding.examples
import
com.twitter.scalding._
import
com.twitter.scalding.mathematics.Matrix
/**
*
Loads
a
directed
graph
adjacency
matrix
where
a[i,j]
=
1
if
there
is
an
edge
from
a[i]
to
b[j]
*
and
computes
the
cosine
of
the
angle
between
every
two
pairs
of
vectors
*/
class
ComputeCosineJob(args
:
Args)
extends
Job(args)
{
import
Matrix._
val
adjacencyMatrix
=
Tsv(
args("input"),
('user1,
'user2,
'rel)
)
.read
.toMatrix[Long,Long,Double]('user1,
'user2,
'rel)
//
we
compute
the
L2
normalized
adjacency
graph
val
matL2Norm
=
adjacencyMatrix.rowL2Normalize
//
we
compute
the
innerproduct
of
the
normalized
matrix
with
itself
//
which
is
equivalent
with
computing
cosine:
AA^T
/
||A||
*
||A||
val
cosDist
=
matL2Norm
*
matL2Norm.transpose
cosDist.write(Tsv(args("output”)))
}
14.
15. What
is
a
monoid?
• Closure
∀a, b ∈ T : a • b ∈ T
• Associa)vity
∀a, b, c ∈ T : (a • b)•c = a •(b •c)
• Iden)ty
element
∃I ∈ T : ∀a ∈ T : I • a = a • I = a
Scala
trait:
trait
Monoid[T]
{
def
zero:
T
def
plus(left:
T,
right:
T):
T
}
19. What’s
the
point?
• Easily
unit
testable
opera)ons
• Simple
aggrega)on
code
à
Beper
quality
20. Word
Count
with
Map
Monoid
package
com.sanoma.cda.examples
import
com.twitter.scalding._
import
com.twitter.algebird.Operators._
class
WordCount3(args:
Args)
extends
Job(args)
{
TextLine(args("input"))
.flatMap('line
-‐>
'word)
{
tokenize
}
.map('word
-‐>
'word)
{
w:
String
=>
Map[String,
Int](w
-‐>
1)
}
.groupAll{
_.sum[Map[String,
Int]]('word)
}
//
We
could
save
the
map
here,
but
if
we
want
similar
output
as
in
previous...
.flatMap('word
-‐>
('word,
'size))
{
words:
Map[String,
Int]
=>
words.toList
}
.groupAll{
_.sortBy(('size,
'word)).reverse
}
//
this
is
just
for
easy
results
.write(Tsv(args("output")))
def
tokenize(text:
String):
Array[String]
=
{
text.toLowerCase.replaceAll("[^a-‐z0-‐9s]",
"").split("s+").filter(
_
!=
"")
}
}
the
and
to
a
of
it
she
said
you
in
i
alice
was
that
as
her
with
at
on
all
1804
912
801
684
625
541
538
462
429
428
400
385
358
291
272
248
228
224
204
197
21. Top
Words
with
CMS
• asd
package
com.sanoma.cda.examples
import
com.twitter.scalding._
import
com.twitter.algebird._
class
WordCount5(args:
Args)
extends
Job(args)
{
implicit
def
utf8(s:
String):
Array[Byte]
=
com.twitter.bijection.Injection.utf8(s)
implicit
val
cmsm
=
new
SketchMapMonoid[String,
Long](128,
6,
0,
20)
//
top
20
type
ApproxMap
=
SketchMap[String,
Long]
TextLine(args("input"))
.flatMap('line
-‐>
'word)
{
tokenize
}
.map('word
-‐>
'word)
{
w:
String
=>
cmsm.create((w,
1L))
}
.groupAll{
_.sum[ApproxMap]('word)
}
.flatMap('word
-‐>
('word,
'size))
{
words:
ApproxMap
=>
words.heavyHitters
}
.write(Tsv(args("output")))
def
tokenize(text:
String):
Array[String]
=
{
text.toLowerCase.replaceAll("[^a-‐z0-‐9s]",
"").split("s+").filter(
_
!=
"")
}
}
the
and
to
a
of
she
it
said
you
in
i
alice
at
was
that
her
with
as
not
be
1859
972
867
748
711
636
619
579
504
495
456
431
407
394
342
341
338
337
290
286