5. The Good: Hadoop!
• Linear scalability
• Schema on read and unstructured data
• Transparent parallelism
• Open source
We want the things we love about Hadoop to be available
in Python, too!
5Scalable Hadoop with succinct Python: the best of both worlds
6. The Good: Succinct code
6Scalable Hadoop with succinct Python: the best of both worlds
7. The Good: Succinct code
Faster development
Less enterprise-y
Lower barrier of entry
7Scalable Hadoop with succinct Python: the best of both worlds
8. The Good: Interpreted, not compiled
Change code in place
Simpler continuous integration
More platform portable
8Scalable Hadoop with succinct Python: the best of both worlds
9. The Good: Python libraries for data
HYPE (Python & data science)
Tighter “integration" with data science
Pandas, scikitlearn, nltk, numpy, scipy,
gensim, matplotlib
9Scalable Hadoop with succinct Python: the best of both worlds
10. The Bad: Python shortfalls
Less enterprise-y
Type safety
No JVM sandbox safety
Performance
10Scalable Hadoop with succinct Python: the best of both worlds
11. Summary of Good & Bad
Love Python/Hadoop for Data Science and Analysis
Deal with Java/Hadoop for Data Engineering and
production code
The benefit of the good and the cost of the
bad is different for everyone!
11Scalable Hadoop with succinct Python: the best of both worlds
12. The Ugly
Nothing in Hadoop is native to Python
Performance can be awful due to serialization and IPC in
most cases
Python bindings are almost always behind, if they exist
Clone some random dude’s code from github and pray to
Guido van Rossum that it compiles and/or works
12Scalable Hadoop with succinct Python: the best of both worlds
14. mrjob
• Write MapReduce jobs in Python!
• Open sourced and maintained by Yelp
• Wraps “Hadoop Streaming” in cpython Python 2.5+
• Well documented
• Can run locally, in Amazon EMR, or Hadoop
14
from mrjob.job import MRJob
class MRWordFreqCount(MRJob):
def mapper(self, _, line):
for word in line.split():
yield (word.lower(), 1)
def reducer(self, word, counts):
yield (word, sum(counts))
Scalable Hadoop with succinct Python: the best of both worlds
15. Pydoop
• Write MapReduce jobs in Python!
• Uses Hadoop C++ Pipes, which should be faster than wrapping streaming
• Actively being worked on
• I’m not sure which is better
15
def mapper(_, v, writer):
for word in v.split():
writer.emit(word.lower(), 1)
def reducer(word, icounts, writer):
writer.emit(word, sum(icounts))
Scalable Hadoop with succinct Python: the best of both worlds
17. Pig
Pig is a higher-level platform and language for analyzing data that happens to
run MapReduce underneath
You can write Pig UDFs in Python. Let Pig control data flow and let Python deal
with the data manipulation!
Can use jython (faster) or cpython (access to more libs)
17
b = FOREACH a GENERATE
revstr(phonenum);
m = GROUP j BY username;
n = FOREACH m GENERATE
group, sortedconcat(j.tags);
@outputSchema(“tags:chararray")
def sortedconcat(bag):
out = set()
for tag in bag:
out.add(tag)
return ‘-’.join(sorted(out))
@outputSchema(“rev:chararray")
def revstr(instr):
return instr[::-1]
Scalable Hadoop with succinct Python: the best of both worlds
18. • A pure Python client that interacts with HDFS
• Handles most NameNode ops (moving/renaming files, deleting)
• Handles most DataNode reading ops (reading files, getmerge)
• Two ways to use: library and command line interface
18
from snakebite.client import Client
client = Client(”1.2.3.4", 54310, use_trash=False)
for x in client.ls(['/data']):
print x
print ‘’.join(client.cat(‘/data/ref/refdata*.csv’))
$ snakebite get /path/in/hdfs/mydata.txt /local/path/data.txt
Scalable Hadoop with succinct Python: the best of both worlds
19. Starbase or Happybase
Uses the HBase Thrift gateway interface (slow)
Last commit 6 months ago
Not really there yet and have failed to gain community momentum.
Java is still king.
19Scalable Hadoop with succinct Python: the best of both worlds
20. PySpark
• Programming in Spark (and PySpark) is in the form of
chaining transformations and actions on RDDs
• RDDs are “Resilient Distributed Datasets”
• RDDs are kept in memory for the most part
20Scalable Hadoop with succinct Python: the best of both worlds
21. PySpark Word Count Example
import sys
from operator import add
from pyspark import SparkContext
if __name__ == "__main__":
if len(sys.argv) != 2:
print >> sys.stderr, "Usage: wordcount <file>"
exit(-1)
sc = SparkContext(appName="PythonWordCount")
lines = sc.textFile(sys.argv[1], 1)
counts = lines.flatMap(lambda x: x.split(' '))
.map(lambda x: (x, 1))
.reduceByKey(add)
output = counts.collect()
for (word, count) in output:
print "%s: %i" % (word, count)
sc.stop()
21
Spark’s native language is Scala, but it
also supports Java and Python
Python API is always a tad behind Scala
Scalable Hadoop with succinct Python: the best of both worlds
25. What’s wrong with this?
Python bindings are almost always fringe projects
Other parts of Hadoop ecosystem are getting way more $
Lack of commercial support
Lack of cohesiveness in APIs and approaches
25Scalable Hadoop with succinct Python: the best of both worlds
26. SCALABLE HADOOP WITH
SUCCINCT PYTHON:
THE BEST OF BOTH WORLDS
Donald Miner
@donaldpminer
dminer@minerkasch.com
Hadoop Summit 2015
26
Editor's Notes
Working with Hadoop using Python instead of Java is entirely possible with a conglomeration of active open source projects. This talk will survey the key projects and show that not only is Hadoop with Python possible, but that it also has some advantages. With Python data analysts can leverage the scale of Hadoop while also leveraging the best of the best data analysis libraries available, most notably numpy, pandas, nltk, and scikit-learn. The following frameworks and tools will be surveyed:
- Interacting with files in the Hadoop Distributed File System with snakebite
- Writing MapReduce jobs with mrjob
- Writing Pig Python UDFs
- Interacting with HBase with starbase
Joke about googling clip art, explaining the metaphor, and that it has nothing to do with this talk.
Talk about what Hadoop with Python literally means here.
It means using Python (instead of Java) to interact with Hadoop
Working with Hadoop using Python instead of Java is entirely possible with a conglomeration of active open source projects. This talk will survey the key projects and show that not only is Hadoop with Python possible, but that it also has some advantages. With Python data analysts can leverage the scale of Hadoop while also leveraging the best of the best data analysis libraries available, most notably numpy, pandas, nltk, and scikit-learn. The following frameworks and tools will be surveyed:
- Interacting with files in the Hadoop Distributed File System with snakebite
- Writing MapReduce jobs with mrjob
- Writing Pig Python UDFs
- Interacting with HBase with starbase