Scalable Hadoop with succinct Python: the best of both worlds

SCALABLE HADOOP WITH
SUCCINCT PYTHON:
THE BEST OF BOTH WORLDS
Donald Miner
@donaldpminer
dminer@minerkasch.com
Hadoop Summit 2015
1

About Don
Scalable Hadoop with succinct Python: the best of both worlds

3Scalable Hadoop with succinct Python: the best of both worlds

Hadoop with Python?

The Good: Hadoop!
• Linear scalability
• Schema on read and unstructured data
• Transparent parallelism
• Open source
We want the things we love about Hadoop to be available
in Python, too!

The Good: Succinct code

The Good: Succinct code
Faster development
Less enterprise-y
Lower barrier of entry

The Good: Interpreted, not compiled
Change code in place
Simpler continuous integration
More platform portable

The Good: Python libraries for data
HYPE (Python & data science)
Tighter “integration" with data science
Pandas, scikitlearn, nltk, numpy, scipy,
gensim, matplotlib

The Bad: Python shortfalls
Less enterprise-y
Type safety
No JVM sandbox safety
Performance

Summary of Good & Bad
Love Python/Hadoop for Data Science and Analysis
Deal with Java/Hadoop for Data Engineering and
production code
The benefit of the good and the cost of the
bad is different for everyone!

The Ugly
Nothing in Hadoop is native to Python
Performance can be awful due to serialization and IPC in
most cases
Python bindings are almost always behind, if they exist
Clone some random dude’s code from github and pray to
Guido van Rossum that it compiles and/or works

Overview of Hadoop/Python projects
13

mrjob
• Write MapReduce jobs in Python!
• Open sourced and maintained by Yelp
• Wraps “Hadoop Streaming” in cpython Python 2.5+
• Well documented
• Can run locally, in Amazon EMR, or Hadoop
14
from mrjob.job import MRJob
class MRWordFreqCount(MRJob):
def mapper(self, _, line):
for word in line.split():
yield (word.lower(), 1)
def reducer(self, word, counts):
yield (word, sum(counts))

Pydoop
• Write MapReduce jobs in Python!
• Uses Hadoop C++ Pipes, which should be faster than wrapping streaming
• Actively being worked on
• I’m not sure which is better
15
def mapper(_, v, writer):
for word in v.split():
writer.emit(word.lower(), 1)
def reducer(word, icounts, writer):
writer.emit(word, sum(icounts))

Python MapReduce options
http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/
Hadoop Streaming – More manual but faster
Hadoopy, Dumbo, haven’t seen commits in years, mrjob in the past 12 hours
Pydoop is main competitor (not in this list)

Pig
Pig is a higher-level platform and language for analyzing data that happens to
run MapReduce underneath
You can write Pig UDFs in Python. Let Pig control data flow and let Python deal
with the data manipulation!
Can use jython (faster) or cpython (access to more libs)
17
b = FOREACH a GENERATE
revstr(phonenum);
m = GROUP j BY username;
n = FOREACH m GENERATE
group, sortedconcat(j.tags);
@outputSchema(“tags:chararray")
def sortedconcat(bag):
out = set()
for tag in bag:
out.add(tag)
return ‘-’.join(sorted(out))
@outputSchema(“rev:chararray")
def revstr(instr):
return instr[::-1]

• A pure Python client that interacts with HDFS
• Handles most NameNode ops (moving/renaming files, deleting)
• Handles most DataNode reading ops (reading files, getmerge)
• Two ways to use: library and command line interface
18
from snakebite.client import Client
client = Client(”1.2.3.4", 54310, use_trash=False)
for x in client.ls(['/data']):
print x
print ‘’.join(client.cat(‘/data/ref/refdata*.csv’))
$ snakebite get /path/in/hdfs/mydata.txt /local/path/data.txt

Starbase or Happybase
Uses the HBase Thrift gateway interface (slow)
Last commit 6 months ago
Not really there yet and have failed to gain community momentum.
Java is still king.

PySpark
• Programming in Spark (and PySpark) is in the form of
chaining transformations and actions on RDDs
• RDDs are “Resilient Distributed Datasets”
• RDDs are kept in memory for the most part

PySpark Word Count Example
import sys
from operator import add
from pyspark import SparkContext
if __name__ == "__main__":
if len(sys.argv) != 2:
print >> sys.stderr, "Usage: wordcount <file>"
exit(-1)
sc = SparkContext(appName="PythonWordCount")
lines = sc.textFile(sys.argv[1], 1)
counts = lines.flatMap(lambda x: x.split(' '))
.map(lambda x: (x, 1))
.reduceByKey(add)
output = counts.collect()
for (word, count) in output:
print "%s: %i" % (word, count)
sc.stop()
21
Spark’s native language is Scala, but it
also supports Java and Python
Python API is always a tad behind Scala

Cassandra

MongoDB

What’s wrong with this?
Python bindings are almost always fringe projects
Other parts of Hadoop ecosystem are getting way more $
Lack of commercial support
Lack of cohesiveness in APIs and approaches

SCALABLE HADOOP WITH
SUCCINCT PYTHON:
THE BEST OF BOTH WORLDS
Donald Miner
@donaldpminer
dminer@minerkasch.com
Hadoop Summit 2015
26

Scalable Hadoop with succinct Python: the best of both worlds

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to Scalable Hadoop with succinct Python: the best of both worlds

Similar to Scalable Hadoop with succinct Python: the best of both worlds (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Scalable Hadoop with succinct Python: the best of both worlds

Editor's Notes