Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/17fsvKl.
Uri Laserson reviews the different available Python frameworks for Hadoop, including a comparison of performance, ease of use/installation, differences in implementation, and other features. Filmed at qconnewyork.com.
Uri Laserson is a data scientist at Cloudera. Previously, he received his PhD from MIT developing applications of high-throughput DNA sequencing to immunology. During that time, he co-founded Good Start Genetics, a next-generation diagnostics company focused on genetic carrier screening. In 2012 he was selected to Forbes's list of 30 under 30.
Ensuring Technical Readiness For Copilot in Microsoft 365
A Guide to Python Frameworks for Hadoop
1. 1
A
Guide
to
Python
Frameworks
for
Hadoop
Uri
Laserson
|
Data
Scien>st
laserson@cloudera.com
14
June
3013
2. InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/python-hadoop
3. Presented at QCon New York
www.qconnewyork.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
4. About
the
speaker
• Joined
Cloudera
late
2012
• Focused
on
life
sciences/medical
• PhD
in
BME/computa>onal
biology
at
MIT/Harvard
(2005-‐2012)
• Focused
on
genomics
• Cofounded
Good
Start
Gene>cs
(2007-‐)
• Applying
next-‐gen
DNA
sequencing
to
gene>c
carrier
screening
2
5. About
the
speaker
• No
formal
training
in
computer
science
• Never
touched
Java
• Almost
all
work
using
Python
3
7. Python
frameworks
for
Hadoop
• Hadoop
Streaming
• mrjob
(Yelp)
• dumbo
• Luigi
(Spo>fy)
• hadoopy
• pydoop
• PySpark
• happy
• Disco
• octopy
• Mortar
Data
• Pig
UDF/Jython
• hipy
5
8. Goals
for
Python
framework
1. “Pseudocodiness”/simplicity
2. Flexibility/generality
3. Ease
of
use/installa>on
4. Performance
6
9. 7
An
n-‐gram
is
a
tuple
of
n
words.
Problem:
aggrega>ng
the
Google
n-‐gram
data
h_p://books.google.com/ngrams
10. 8
An
n-‐gram
is
a
tuple
of
n
words.
Problem:
aggrega>ng
the
Google
n-‐gram
data
h_p://books.google.com/ngrams
1
2
3
4
5
6
7
8
(
)
8-‐gram
11. 9
"A
par'al
differen'al
equa'on
is
an
equa'on
that
contains
par'al
deriva'ves."
12. 10
A
par'al
differen'al
equa'on
is
an
equa'on
that
contains
par'al
deriva'ves.
A 1!
partial 2!
differential 1!
equation 2!
is 1!
an 1!
that 1!
contains 1!
derivatives. 1!
1-‐grams
13. 11
A
par'al
differen'al
equa'on
is
an
equa'on
that
contains
par'al
deriva'ves.
A partial 1!
partial differential 1!
differential equation 1!
equation is 1!
is an 1!
an equation 1!
equation that 1!
that contains 1!
contains partial 1!
partial derivatives. 1!
2-‐grams
14. 12
A
par'al
differen'al
equa'on
is
an
equa'on
that
contains
par'al
deriva'ves.
A partial differential equation is 1!
partial differential equation is an 1!
differential equation is an equation 1!
equation is an equation that 1!
is an equation that contains 1!
an equation that contains partial 1!
equation that contains partial derivatives. 1!
5-‐grams
25. Hadoop
Streaming:
features
• Canonical
method
for
using
any
executable
as
mapper/reducer
• Includes
shell
commands,
like
grep
• Transparent
communica>on
with
Hadoop
though
stdin/stdout
• Key
boundaries
manually
detected
in
reducer
• Built-‐in
with
Hadoop:
should
require
no
addi>onal
framework
installa>on
• Developer
must
decide
how
to
encode
more
complicated
objects
(e.g.,
JSON)
or
binary
data
23
26. mrjob
24
class NgramNeighbors(MRJob):!
# specify input/intermed/output serialization!
# default output protocol is JSON; here we set it to text!
OUTPUT_PROTOCOL = RawProtocol!
!
def mapper(self, key, line):!
pass!
!
def combiner(self, key, counts):!
pass!
!
def reducer(self, key, counts):!
pass!
!
if __name__ == '__main__':!
# sets up a runner, based on command line options!
NgramNeighbors.run()!
!
!
29. mrjob:
features
• Abstracted
MapReduce
interface
• Handles
complex
Python
objects
• Mul>-‐step
MapReduce
workflows
• Extremely
>ght
AWS
integra>on
• Easily
choose
to
run
locally,
on
Hadoop
cluster,
or
on
EMR
• Ac>vely
developed;
great
documenta>on
27
30. mrjob:
serializa>on
28
class MyMRJob(mrjob.job.MRJob):!
INPUT_PROTOCOL = mrjob.protocol.RawValueProtocol!
INTERNAL_PROTOCOL = mrjob.protocol.JSONProtocol!
OUTPUT_PROTOCOL = mrjob.protocol.JSONProtocol!
Defaults
RawProtocol / RawValueProtocol!
JSONProtocol / JSONValueProtocol!
PickleProtocol / PickleValueProtocol!
ReprProtocol / ReprValueProtocol!
Available
Custom
protocols
can
be
wri_en.
No
current
support
for
binary
serializa>on
schemes.
31. dumbo
• Similar
in
spirit
to
mrjob
• abstracted
• complex
objects
• various
runners
• composable
jobs
• Sporadically
developed?
• Documenta>on
is
a
series
of
blog
posts
29
32. dumbo:
serializa>on
• Typed
bytes
added
to
Hadoop
allowing
binary
data
• ctypedbytes
• binary
serializa>on
• packs
Python
objects
in
C
structs
• Much
faster
and
more
efficient
than
JSON
or
pickle
• Na>vely
read
SequenceFile
• Execute
code
from
any
Python
egg
or
JAR
• Point
to
any
Java
InputFormat!
30
33. dumbo:
installa>on
notes
• Required
manual
install
on
each
node
• dumbo
and
typedbytes
had
to
be
installed
as
Python
eggs
• Had
trouble
running
a
combiner
due
to
MemoryErrors!
31
34. hadoopy
• Similar
to
dumbo,
with
be_er
docs
• Typedbytes
serializa>on
• Experimental
Hbase
integra>on
• Allows
launching
python
jobs
even
on
nodes
that
do
not
have
Python
• No
command
line
u>lity:
must
launch
MR
jobs
within
a
python
program
32
35. pydoop
• Wraps
Hadoop
Pipes
(C++
API)
instead
of
Streaming
• HDFS
commands
communicate
through
libhdfs
rather
than
shell
• Ability
to
implement
a
Python
Partitioner,
RecordReader,
and
RecordWriter!
• All
input/output
must
be
strings
• Could
not
install
it
33
36. luigi
• Full-‐fledged
workflow
management,
task
scheduling,
dependency
resolu>on
tool
in
Python
(similar
to
Apache
Oozie)
• Built-‐in
support
for
Hadoop
by
wrapping
Streaming
• Not
as
fully-‐featured
as
mrjob
for
Hadoop,
but
easily
customizable
• Internal
serializa>on
through
repr/eval
• Ac>vely
developed
at
Spo>fy
• README
is
good
but
documenta>on
is
lacking
34
56. Conclusions
• Prefer
Hadoop
Streaming
if
possible
• It’s
easy
enough
• Lowest
overhead
• Prefer
mrjob
for
higher
abstrac>on
• Ac>vely
developed/great
documenta>on
• Feature-‐rich
(incl.
composable
jobs)
• Integra>on
with
AWS
• Prefer
luigi
for
more
complicated
job
flows
• Ac>vely
developed
• Much
more
general
than
purely
Hadoop
54