A Guide to Python Frameworks for Hadoop

1
A
Guide
to
Python
Frameworks
for
Hadoop

Uri
Laserson
|
Data
Scien>st

laserson@cloudera.com

14
June
3013

InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/python-hadoop

Presented at QCon New York
www.qconnewyork.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide

About
the
speaker

•  Joined
Cloudera
late
2012

•  Focused
on
life
sciences/medical

•  PhD
in
BME/computa>onal
biology
at
MIT/Harvard

(2005-‐2012)

•  Focused
on
genomics

•  Cofounded
Good
Start
Gene>cs
(2007-‐)

•  Applying
next-‐gen
DNA
sequencing
to
gene>c
carrier

screening

2

About
the
speaker

•  No
formal
training
in
computer
science

•  Never
touched
Java

•  Almost
all
work
using
Python

3

Python
frameworks
for
Hadoop

•  Hadoop
Streaming

•  mrjob
(Yelp)

•  dumbo

•  Luigi
(Spo>fy)

•  hadoopy

•  pydoop

•  PySpark

•  happy

•  Disco

•  octopy

•  Mortar
Data

•  Pig
UDF/Jython

•  hipy

5

Goals
for
Python
framework

1.  “Pseudocodiness”/simplicity

2.  Flexibility/generality

3.  Ease
of
use/installa>on

4.  Performance

6

7

An
n-‐gram
is
a
tuple
of
n
words.

Problem:
aggrega>ng
the
Google
n-‐gram
data

h_p://books.google.com/ngrams

8

An
n-‐gram
is
a
tuple
of
n
words.

Problem:
aggrega>ng
the
Google
n-‐gram
data

h_p://books.google.com/ngrams

1
2
3
4
5
6
7
8

(
)

8-‐gram

9

"A
par'al
diﬀeren'al
equa'on
is
an
equa'on
that
contains
par'al
deriva'ves."

10

A
par'al
diﬀeren'al
equa'on
is
an
equa'on
that
contains
par'al
deriva'ves.

A 1!
partial 2!
differential 1!
equation 2!
is 1!
an 1!
that 1!
contains 1!
derivatives. 1!
1-‐grams

11

A
par'al
diﬀeren'al
equa'on
is
an
equa'on
that
contains
par'al
deriva'ves.

A partial 1!
partial differential 1!
differential equation 1!
equation is 1!
is an 1!
an equation 1!
equation that 1!
that contains 1!
contains partial 1!
partial derivatives. 1!
2-‐grams

12

A
par'al
diﬀeren'al
equa'on
is
an
equa'on
that
contains
par'al
deriva'ves.

A partial differential equation is 1!
partial differential equation is an 1!
differential equation is an equation 1!
equation is an equation that 1!
is an equation that contains 1!
an equation that contains partial 1!
equation that contains partial derivatives. 1!
5-‐grams

14

flourished in 1993 2 2 2!
flourished in 2008 220 215 118!
fluid of 1899 2 2 1!
fluid of 2000 3 3 1!
fluid of 2002 2 1 1!
fluid of 2003 3 3 1!
fluid of 2004 3 3 3!
2-‐gram
year
matches
pages
volumes

15

Compute
how
ocen
two
words
are
near
each

other
in
a
given
year.

Two
words
are
“near”
if
they
are
both

present
in
a
2-‐,
3,
4-‐,
or
5-‐gram.

16

...2-grams...!
(cat, the) 1999 14!
(the, cat) 1999 7002!
!
...3-grams...!
(the, cheshire, cat) 1999 563!
!
...4-grams...!
!
...5-grams...!
(the, cat, in, the, hat) 1999 1023!
(the, dog, chased, the, cat) 1999 403!
(cat, is, one, of, the) 1999 24!
(cat, the) 1999 8006!
(hat, the) 1999 1023!
raw
data

aggregated
results

lexicographic

ordering

internal
n-‐grams
counted
by
smaller
n-‐grams:

•  avoids
double-‐coun>ng

•  increases
sensi>vity
(observed
at
least
40
>mes)

Pseudocode
for
MapReduce

17
def map(record):!
    (ngram, year, count) = unpack(record)!
    // ensure word1 has the lexicographically first word:!
    (word1, word2) = sorted(ngram[first], ngram[last])!
    key = (word1, word2, year)!
    emit(key, count)!
!
!
def reduce(key, values):!
    emit(key, sum(values))!
All
source
code
available
on
GitHub:

h_ps://github.com/cloudera/python-‐ngrams

Na>ve
Java

18
import org.apache.hadoop.conf.Configured;!
import org.apache.hadoop.fs.Path;!
import org.apache.hadoop.io.IntWritable;!
import org.apache.hadoop.mapreduce.Job;!
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;!
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;!
import org.apache.hadoop.util.Tool;!
import org.apache.hadoop.util.ToolRunner;!
!
!
public class NgramsDriver extends Configured implements Tool {!
!
public int run(String[] args) throws Exception {!
Job job = new Job(getConf());!
job.setJarByClass(getClass());!
!
FileInputFormat.addInputPath(job, new Path(args[0]));!
FileOutputFormat.setOutputPath(job, new Path(args[1]));!
!
job.setMapperClass(NgramsMapper.class);!
job.setCombinerClass(NgramsReducer.class);!
job.setReducerClass(NgramsReducer.class);!
!
job.setOutputKeyClass(TextTriple.class);!
job.setOutputValueClass(IntWritable.class);!
!
job.setNumReduceTasks(10);!
!
return job.waitForCompletion(true) ? 0 : 1;!
}!
!
public static void main(String[] args) throws Exception {!
int exitCode = ToolRunner.run(new NgramsDriver(), args);!
System.exit(exitCode);!
}!
}!
!
import java.io.IOException;!
import java.util.ArrayList;!
import java.util.Collections;!
import java.util.List;!
import java.util.regex.Matcher;!
import java.util.regex.Pattern;!
!
import org.apache.hadoop.io.LongWritable;!
import org.apache.hadoop.io.Text;!
import org.apache.hadoop.mapreduce.Mapper;!
import org.apache.hadoop.mapreduce.lib.input.FileSplit;!
import org.apache.log4j.Logger;!
!
!
public class NgramsMapper extends Mapper<LongWritable, Text, TextTriple, IntWritable> {!
!
private Logger LOG = Logger.getLogger(getClass());!
!
private int expectedTokens;!
!
@Override!
protected void setup(Context context) throws IOException, InterruptedException {!
String inputFile = ((FileSplit) context.getInputSplit()).getPath().getName();!
LOG.info("inputFile: " + inputFile);!
Pattern c = Pattern.compile("([d]+)gram");!
Matcher m = c.matcher(inputFile);!
m.find();!
expectedTokens = Integer.parseInt(m.group(1));!
return;!
}!
!
@Override!
public void map(LongWritable key, Text value, Context context)!
throws IOException, InterruptedException {!
String[] data = value.toString().split("t");!
!
if (data.length < 3) {!
return;!
}!
!
String[] ngram = data[0].split("s+");!
String year = data[1];!
IntWritable count = new IntWritable(Integer.parseInt(data[2]));!
!
if (ngram.length != this.expectedTokens) {!
return;!
}!
!
// build keyOut!
List<String> triple = new ArrayList<String>(3);!
triple.add(ngram[0]);!
triple.add(ngram[expectedTokens - 1]);!
Collections.sort(triple);!
triple.add(year);!
TextTriple keyOut = new TextTriple(triple);!
!
context.write(keyOut, count);!
}!
}!
!
import org.apache.hadoop.mapreduce.Reducer;!
!
!
public class NgramsReducer extends Reducer<TextTriple, IntWritable, TextTriple, IntWritable>
!
@Override!
protected void reduce(TextTriple key, Iterable<IntWritable> values, Context context)!
throws IOException, InterruptedException {!
int sum = 0;!
for (IntWritable value : values) {!
sum += value.get();!
}!
context.write(key, new IntWritable(sum));!
}!
}!
!
import java.io.DataInput;!
import java.io.DataOutput;!
import java.util.List;!
!
import org.apache.hadoop.io.Text;!
import org.apache.hadoop.io.WritableComparable;!
!
!
public class TextTriple implements WritableComparable<TextTriple> {!
!
private Text first;!
private Text second;!
private Text third;!
!
public TextTriple() {!
set(new Text(), new Text(), new Text());!
}!
!
public TextTriple(List<String> list) {!
set(new Text(list.get(0)),!
new Text(list.get(1)),!
new Text(list.get(2)));!
}!
!
public void set(Text first, Text second, Text third) {!
this.first = first;!
this.second = second;!
this.third = third;!
}!
!
public void write(DataOutput out) throws IOException {!
first.write(out);!
second.write(out);!
third.write(out);!
}!
!
public void readFields(DataInput in) throws IOException {!
first.readFields(in);!
second.readFields(in);!
third.readFields(in);!
}!
!
@Override!
public int hashCode() {!
return first.hashCode() * 163 + second.hashCode() * 31 + third.hashCode();!
}!
!
@Override!
public boolean equals(Object obj) {!
if (obj instanceof TextTriple) {!
TextTriple tt = (TextTriple) obj;!
return first.equals(tt.first) && second.equals(tt.second) && third.equals(tt.thir
}!
return false;!
}!
!
@Override!
public String toString() {!
return first + "t" + second + "t" + third;!
}!
!
public int compareTo(TextTriple other) {!
int comp = first.compareTo(other.first);!
if (comp != 0) {!
return comp;!
}!
comp = second.compareTo(other.second);!
if (comp != 0) {!
return comp;!
}!
return third.compareTo(other.third);!
} !
}!

Na>ve
Java

•  Maximum
ﬂexibility

•  Fastest
performance

•  Na>ve
to
Hadoop

•  Most
diﬃcult
to
write

19

Python
implementa>on
strategies

•  Hadoop
Streaming

•  mrjob

•  dumbo

•  hadoopy

•  Hadoop
Pipes

•  pydoop

•  Non-‐Hadoop

•  Disco

•  octopy

20

Hadoop
Streaming:
execu>on

21
hadoop jar hadoop-streaming-2.0.0-mr1-cdh4.1.2.jar !
-input /ngrams !
-output /output-streaming !
-mapper mapper.py !
-combiner reducer.py !
-reducer reducer.py !
-jobconf stream.num.map.output.key.fields=3 !
-jobconf stream.num.reduce.output.key.fields=3 !
-jobconf mapred.reduce.tasks=10 !
-file mapper.py !
-file reducer.py!

Hadoop
Streaming:
code

22

Hadoop
Streaming:
features

•  Canonical
method
for
using
any
executable
as

mapper/reducer

•  Includes
shell
commands,
like
grep

•  Transparent
communica>on
with
Hadoop
though

stdin/stdout

•  Key
boundaries
manually
detected
in
reducer

•  Built-‐in
with
Hadoop:
should
require
no
addi>onal

framework
installa>on

•  Developer
must
decide
how
to
encode
more

complicated
objects
(e.g.,
JSON)
or
binary
data

23

mrjob

24
class NgramNeighbors(MRJob):!
# specify input/intermed/output serialization!
# default output protocol is JSON; here we set it to text!
OUTPUT_PROTOCOL = RawProtocol!
!
def mapper(self, key, line):!
pass!
!
def combiner(self, key, counts):!
pass!
!
def reducer(self, key, counts):!
pass!
!
if __name__ == '__main__':!
# sets up a runner, based on command line options!
NgramNeighbors.run()!
!
!

mrjob:
runner

25
./ngrams.py -r hadoop !
--hadoop-bin /usr/bin/hadoop !
--jobconf mapred.reduce.tasks=10 !
-o hdfs:///output-mrjob !
hdfs:///ngrams!

mrjob:
features

•  Abstracted
MapReduce
interface

•  Handles
complex
Python
objects

•  Mul>-‐step
MapReduce
workﬂows

•  Extremely
>ght
AWS
integra>on

•  Easily
choose
to
run
locally,
on
Hadoop
cluster,
or
on

EMR

•  Ac>vely
developed;
great
documenta>on

27

mrjob:
serializa>on

28
class MyMRJob(mrjob.job.MRJob):!
INPUT_PROTOCOL = mrjob.protocol.RawValueProtocol!
INTERNAL_PROTOCOL = mrjob.protocol.JSONProtocol!
OUTPUT_PROTOCOL = mrjob.protocol.JSONProtocol!
Defaults

RawProtocol / RawValueProtocol!
JSONProtocol / JSONValueProtocol!
PickleProtocol / PickleValueProtocol!
ReprProtocol / ReprValueProtocol!
Available

Custom
protocols
can
be
wri_en.

No
current
support
for
binary
serializa>on
schemes.

dumbo

•  Similar
in
spirit
to
mrjob

•  abstracted

•  complex
objects

•  various
runners

•  composable
jobs

•  Sporadically
developed?

•  Documenta>on
is
a
series
of
blog
posts

29

dumbo:
serializa>on

•  Typed
bytes
added
to
Hadoop
allowing
binary
data

•  ctypedbytes

•  binary
serializa>on

•  packs
Python
objects
in
C
structs

•  Much
faster
and
more
eﬃcient
than
JSON
or
pickle

•  Na>vely
read
SequenceFile

•  Execute
code
from
any
Python
egg
or
JAR

•  Point
to
any
Java
InputFormat!
30

dumbo:
installa>on
notes

•  Required
manual
install
on
each
node

•  dumbo
and
typedbytes
had
to
be
installed
as
Python

eggs

•  Had
trouble
running
a
combiner
due
to

MemoryErrors!
31

hadoopy

•  Similar
to
dumbo,
with
be_er
docs

•  Typedbytes
serializa>on

•  Experimental
Hbase
integra>on

•  Allows
launching
python
jobs
even
on
nodes
that
do

not
have
Python

•  No
command
line
u>lity:
must
launch
MR
jobs
within

a
python
program

32

pydoop

•  Wraps
Hadoop
Pipes
(C++
API)
instead
of
Streaming

•  HDFS
commands
communicate
through
libhdfs
rather

than
shell

•  Ability
to
implement
a
Python
Partitioner,

RecordReader,
and
RecordWriter!
•  All
input/output
must
be
strings

•  Could
not
install
it

33

luigi

•  Full-‐ﬂedged
workﬂow
management,
task
scheduling,

dependency
resolu>on
tool
in
Python
(similar
to

Apache
Oozie)

•  Built-‐in
support
for
Hadoop
by
wrapping
Streaming

•  Not
as
fully-‐featured
as
mrjob
for
Hadoop,
but
easily

customizable

•  Internal
serializa>on
through
repr/eval

•  Ac>vely
developed
at
Spo>fy

•  README
is
good
but
documenta>on
is
lacking

34

luigi:
runner

35
python ngrams.py Ngrams !
--local-scheduler !
--n-reduce-tasks 10 !
--source /ngrams !
--destination /output-luigi!

Python
frameworks
for
Hadoop

•  Hadoop
Streaming
✓

•  mrjob
(Yelp)
✓

•  dumbo✓

•  Luigi
(Spo>fy)
✓

•  hadoopy✓

•  pydoop❌

•  PySpark

•  happy

•  Disco

•  octopy

•  Mortar
Data

•  Pig
UDF/Jython

•  hipy

37

Python
frameworks
for
Hadoop

•  Hadoop
Streaming
✓

•  mrjob
(Yelp)
✓

•  dumbo✓

•  Luigi
(Spo>fy)
✓

•  hadoopy✓

•  pydoop❌

•  PySpark
not
Hadoop

•  happy

•  Disco

•  octopy

•  Mortar
Data

•  Pig
UDF/Jython

•  hipy

38

Python
frameworks
for
Hadoop

•  Hadoop
Streaming
✓

•  mrjob
(Yelp)
✓

•  dumbo✓

•  Luigi
(Spo>fy)
✓

•  hadoopy✓

•  pydoop❌

•  PySpark
not
Hadoop

•  happy
abandoned?
Jython-‐based

•  Disco

•  octopy

•  Mortar
Data

•  Pig
UDF/Jython

•  hipy

39

Python
frameworks
for
Hadoop

•  Hadoop
Streaming
✓

•  mrjob
(Yelp)
✓

•  dumbo✓

•  Luigi
(Spo>fy)
✓

•  hadoopy✓

•  pydoop❌

•  PySpark
not
Hadoop

•  happy
abandoned?
Jython-‐based

•  Disco
not
Hadoop

•  octopy

•  Mortar
Data

•  Pig
UDF/Jython

•  hipy

40

Python
frameworks
for
Hadoop

•  Hadoop
Streaming
✓

•  mrjob
(Yelp)
✓

•  dumbo✓

•  Luigi
(Spo>fy)
✓

•  hadoopy✓

•  pydoop❌

•  PySpark
not
Hadoop

•  happy
abandoned?
Jython-‐based

•  Disco
not
Hadoop

•  octopy
not
serious/not
Hadoop

•  Mortar
Data

•  Pig
UDF/Jython

•  hipy

41

Python
frameworks
for
Hadoop

•  Hadoop
Streaming
✓

•  mrjob
(Yelp)
✓

•  dumbo✓

•  Luigi
(Spo>fy)
✓

•  hadoopy✓

•  pydoop❌

•  PySpark
not
Hadoop

•  happy
abandoned?
Jython-‐based

•  Disco
not
Hadoop

•  octopy
not
serious/not
Hadoop

•  Mortar
Data
HaaS;
support
numpy,
scipy,
nltk,
pip-‐installable
in
UDF

•  Pig
UDF/Jython

•  hipy

42

Python
frameworks
for
Hadoop

•  Hadoop
Streaming
✓

•  mrjob
(Yelp)
✓

•  dumbo✓

•  Luigi
(Spo>fy)
✓

•  hadoopy✓

•  pydoop❌

•  PySpark
not
Hadoop

•  happy
abandoned?
Jython-‐based

•  Disco
not
Hadoop

•  octopy
not
serious/not
Hadoop

•  Mortar
Data
HaaS;
support
numpy,
scipy,
nltk,
pip-‐installable
in
UDF

•  Pig
UDF/Jython
Pig
is
another
talk;
Jython
limited

•  hipy

43

Python
frameworks
for
Hadoop

•  Hadoop
Streaming
✓

•  mrjob
(Yelp)
✓

•  dumbo✓

•  Luigi
(Spo>fy)
✓

•  hadoopy✓

•  pydoop❌

•  PySpark
not
Hadoop

•  happy
abandoned?
Jython-‐based

•  Disco
not
Hadoop

•  octopy
not
serious/not
Hadoop

•  Mortar
Data
HaaS;
support
numpy,
scipy,
nltk,
pip-‐installable
in
UDF

•  Pig
UDF/Jython
Pig
is
another
talk;
Jython
limited

•  hipy
Python
syntac>c
sugar
to
construct
Hive
queries

44

Commit
ac>vity

45
mrjob

dumbo

Commit
ac>vity

46
luigi

hadoopy

The
cluster

•  5
virtual
machines

•  4
CPUs

•  10
GB
RAM

•  100
GB
disk

•  CentOS
6.2

•  CDH4
(Hadoop
2)

•  20
map
tasks

•  10
reduce
tasks

•  Python
2.6

47

(Unscien>ﬁc)
performance
comparison

48

(Unscien>ﬁc)
performance
comparison

49
Streaming
has

lowest
overhead

(Unscien>ﬁc)
performance
comparison

50
JSON
SerDe

(Unscien>ﬁc)
performance
comparison

51
Combiner
was
not
used

Feature
comparison

52

Feature
comparison

53

Conclusions

•  Prefer
Hadoop
Streaming
if
possible

•  It’s
easy
enough

•  Lowest
overhead

•  Prefer
mrjob
for
higher
abstrac>on

•  Ac>vely
developed/great
documenta>on

•  Feature-‐rich
(incl.
composable
jobs)

•  Integra>on
with
AWS

•  Prefer
luigi
for
more
complicated
job
ﬂows

•  Ac>vely
developed

•  Much
more
general
than
purely
Hadoop

54

Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/python-
hadoop

A Guide to Python Frameworks for Hadoop

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (20)

Mehr von C4Media

Mehr von C4Media (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

A Guide to Python Frameworks for Hadoop