Python in the Hadoop Ecosystem (Rock Health presentation)

1
A Guide to Python Frameworks for Hadoop
Uri Laserson
laserson@cloudera.com
20 March 2014

Goals for today
1. Easy to jump into Hadoop with Python
2. Describe 5 ways to use Python with Hadoop, batch
and interactive
3. Guidelines for choosing Python framework
2

3
Code:
https://github.com/laserson/rock-health-python
Blog post:
http://blog.cloudera.com/blog/2013/01/a-guide-to-
python-frameworks-for-hadoop/
Slides:
http://www.slideshare.net/urilaserson/

About the speaker
• Joined Cloudera late 2012
• Focus on life sciences/medical
• PhD in BME/computational biology at MIT/Harvard
(2005-2012)
• Focused on genomics
• Cofounded Good Start Genetics (2007-)
• Applying next-gen DNA sequencing to genetic carrier
screening
4

About the speaker
• No formal training in computer science
• Never touched Java
• Almost all work using Python
5

Python frameworks for Hadoop
• Hadoop Streaming
• mrjob (Yelp)
• dumbo
• Luigi (Spotify)
• hadoopy
• pydoop
• PySpark
• happy
• Disco
• octopy
• Mortar Data
• Pig UDF/Jython
• hipy
• Impala + Numba
7

Goals for Python framework
1. “Pseudocodiness”/simplicity
2. Flexibility/generality
3. Ease of use/installation
4. Performance
8

• mrjob (Yelp)
• dumbo
• Luigi (Spotify)
• hadoopy
• pydoop
• PySpark
• happy
• Disco
• octopy
• Mortar Data
• Pig UDF/Jython
• hipy
• Impala + Numba
9

• mrjob (Yelp)
• dumbo
• Luigi (Spotify)
• hadoopy
• pydoop
• PySpark
• happy abandoned? Jython-based
• Disco not Hadoop
• octopy not serious/not Hadoop
• Mortar Data HaaS; support numpy, scipy, nltk, pip-installable in UDF
• Pig UDF/Jython Pig is another talk; Jython limited
• hipy Python syntactic sugar to construct Hive queries
• Impala + Numba
10

11
An n-gram is a tuple of n words.
Problem: aggregating the Google n-gram data
http://books.google.com/ngrams

12
An n-gram is a tuple of n words.
Problem: aggregating the Google n-gram data
http://books.google.com/ngrams
1 2 3 4 5 6 7 8
( )
8-gram

13
"A partial differential equation is an equation that contains partial derivatives."

14
A partial differential equation is an equation that contains partial derivatives.
A 1
partial 2
differential 1
equation 2
is 1
an 1
that 1
contains 1
derivatives. 1
1-grams

15
A partial 1
partial differential 1
differential equation 1
equation is 1
is an 1
an equation 1
equation that 1
that contains 1
contains partial 1
partial derivatives. 1
2-grams

16
A partial differential equation is 1
partial differential equation is an 1
differential equation is an equation 1
equation is an equation that 1
is an equation that contains 1
an equation that contains partial 1
equation that contains partial derivatives. 1
5-grams

19
flourished in 1993 2 2 2
flourished in 2008 220 215 118
fluid of 1899 2 2 1
fluid of 2000 3 3 1
fluid of 2002 2 1 1
fluid of 2003 3 3 1
fluid of 2004 3 3 3
2-gram year matches pages volumes

20
Compute how often two words are near each
other in a given year.
Two words are “near” if they are both
present in a 2-, 3-, 4-, or 5-gram.

21
...2-grams...
(cat, the) 1999 14
(the, cat) 1999 7002
...3-grams...
(the, cheshire, cat) 1999 563
...4-grams...
...5-grams...
(the, cat, in, the, hat) 1999 1023
(the, dog, chased, the, cat) 1999 403
(cat, is, one, of, the) 1999 24
(cat, the) 1999 8006
(hat, the) 1999 1023
raw data
aggregated results
lexicographic
ordering
internal n-grams counted by smaller n-grams:
• avoids double-counting
• increases sensitivity (observed at least 40 times)

What is Hadoop?
• Ecosystem of tools
• Core is the HDFS file system
• Downloadable set of jars that can be run on any
machine
22

HDFS design assumptions
• Based on Google File System
• Files are large (GBs to TBs)
• Failures are common
• Massive scale means failures very likely
• Disk, node, or network failures
• Accesses are large and sequential
• Files are append-only
23

HDFS properties
• Fault-tolerant
• Gracefully responds to node/disk/network failures
• Horizontally scalable
• Low marginal cost
• High-bandwidth
24
1
2
3
4
5
2
4
5
1
2
5
1
3
4
2
3
5
1
3
4
Input File
HDFS storage distribution
Node A Node B Node C Node D Node E

MapReduce computation
• Structured as
1. Embarrassingly parallel “map stage”
2. Cluster-wide distributed sort (“shuffle”)
3. Aggregation “reduce stage”
• Data-locality: process the data where it is stored
• Fault-tolerance: failed tasks automatically detected
and restarted
• Schema-on-read: data must not be stored conforming
to rigid schema
26

Pseudocode for MapReduce
27
def map(record):
(ngram, year, count) = unpack(record)
// ensure word1 has the lexicographically first word:
(word1, word2) = sorted(ngram[first], ngram[last])
key = (word1, word2, year)
emit(key, count)
def reduce(key, values):
emit(key, sum(values))
All source code available on GitHub:

Native Java
28
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class NgramsDriver extends Configured implements Tool {
public int run(String[] args) throws Exception {
Job job = new Job(getConf());
job.setJarByClass(getClass());
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(NgramsMapper.class);
job.setCombinerClass(NgramsReducer.class);
job.setReducerClass(NgramsReducer.class);
job.setOutputKeyClass(TextTriple.class);
job.setOutputValueClass(IntWritable.class);
job.setNumReduceTasks(10);
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new NgramsDriver(), args);
System.exit(exitCode);
}
}
import java.io.IOException;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.log4j.Logger;
public class NgramsMapper extends Mapper<LongWritable, Text, TextTriple, IntWritable> {
private Logger LOG = Logger.getLogger(getClass());
private int expectedTokens;
@Override
protected void setup(Context context) throws IOException, InterruptedException {
String inputFile = ((FileSplit) context.getInputSplit()).getPath().getName();
LOG.info("inputFile: " + inputFile);
Pattern c = Pattern.compile("([d]+)gram");
Matcher m = c.matcher(inputFile);
m.find();
expectedTokens = Integer.parseInt(m.group(1));
return;
}
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] data = value.toString().split("t");
if (data.length < 3) {
return;
}
String[] ngram = data[0].split("s+");
String year = data[1];
IntWritable count = new IntWritable(Integer.parseInt(data[2]));
if (ngram.length != this.expectedTokens) {
return;
}
// build keyOut
List<String> triple = new ArrayList<String>(3);
triple.add(ngram[0]);
triple.add(ngram[expectedTokens - 1]);
Collections.sort(triple);
triple.add(year);
TextTriple keyOut = new TextTriple(triple);
context.write(keyOut, count);
}
}
import org.apache.hadoop.mapreduce.Reducer;
public class NgramsReducer extends Reducer<TextTriple, IntWritable, TextTriple, IntWritable> {
@Override
protected void reduce(TextTriple key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable value : values) {
sum += value.get();
}
context.write(key, new IntWritable(sum));
}
}
import java.io.DataInput;
import java.io.DataOutput;
import java.util.List;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
public class TextTriple implements WritableComparable<TextTriple> {
private Text first;
private Text second;
private Text third;
public TextTriple() {
set(new Text(), new Text(), new Text());
}
public TextTriple(List<String> list) {
set(new Text(list.get(0)),
new Text(list.get(1)),
new Text(list.get(2)));
}
public void set(Text first, Text second, Text third) {
this.first = first;
this.second = second;
this.third = third;
}
public void write(DataOutput out) throws IOException {
first.write(out);
second.write(out);
third.write(out);
}
public void readFields(DataInput in) throws IOException {
first.readFields(in);
second.readFields(in);
third.readFields(in);
}
@Override
public int hashCode() {
return first.hashCode() * 163 + second.hashCode() * 31 + third.hashCode();
}
@Override
public boolean equals(Object obj) {
if (obj instanceof TextTriple) {
TextTriple tt = (TextTriple) obj;
return first.equals(tt.first) && second.equals(tt.second) && third.equals(tt.third);
}
return false;
}
@Override
public String toString() {
return first + "t" + second + "t" + third;
}
public int compareTo(TextTriple other) {
int comp = first.compareTo(other.first);
if (comp != 0) {
return comp;
}
comp = second.compareTo(other.second);
if (comp != 0) {
return comp;
}
return third.compareTo(other.third);
}
}

Native Java
• Maximum flexibility
• Fastest performance
• Native to Hadoop
• Most difficult to write
29

Hadoop Streaming
30
hadoop jar hadoop-streaming-*-.jar
-input path/to/input
-output path/to/output
-mapper “grep WARN”

Hadoop Streaming: features
• Canonical method for using any executable as
mapper/reducer
• Includes shell commands, like grep
• Transparent communication with Hadoop though
stdin/stdout
• Key boundaries manually detected in reducer
• Built-in with Hadoop: should require no additional
framework installation
• Developer must decide how to encode more
complicated objects (e.g., JSON) or binary data
31

mrjob
33
class NgramNeighbors(MRJob):
# specify input/intermed/output serialization
# default output protocol is JSON; here we set it to text
OUTPUT_PROTOCOL = RawProtocol
def mapper(self, key, line):
pass
def combiner(self, key, counts):
pass
def reducer(self, key, counts):
pass
if __name__ == '__main__':
# sets up a runner, based on command line options
NgramNeighbors.run()

mrjob: features
• Abstracted MapReduce interface
• Handles complex Python objects
• Multi-step MapReduce workflows
• Extremely tight AWS integration
• Easily choose to run locally, on Hadoop cluster, or on
EMR
• Actively developed; great documentation
34

mrjob: serialization
36
class MyMRJob(mrjob.job.MRJob):
INPUT_PROTOCOL = mrjob.protocol.RawValueProtocol
INTERNAL_PROTOCOL = mrjob.protocol.JSONProtocol
OUTPUT_PROTOCOL = mrjob.protocol.JSONProtocol
Defaults
RawProtocol / RawValueProtocol
JSONProtocol / JSONValueProtocol
PickleProtocol / PickleValueProtocol
ReprProtocol / ReprValueProtocol
Available
Custom protocols can be written.
No current support for binary serialization schemes.

luigi
• Full-fledged workflow management, task
scheduling, dependency resolution tool in Python
(similar to Apache Oozie)
• Built-in support for Hadoop by wrapping Streaming
• Not as fully-featured as mrjob for Hadoop, but easily
customizable
• Internal serialization through repr/eval
• Actively developed at Spotify
• README is good but documentation is lacking
37

The cluster used for benchmarking
• 5 virtual machines
• 4 CPUs
• 10 GB RAM
• 100 GB disk
• CentOS 6.2
• CDH4 (Hadoop 2)
• 20 map tasks
• 10 reduce tasks
• Python 2.6
39

(Unscientific) performance comparison
40

41
Streaming has
lowest overhead

42
JSON SerDe

What is Spark?
• Started in 2009 as academic project from Amplab at
UCBerkeley; now ASF and >100 contributors
• In-memory distributed execution engine
• Operates on Resilient Distributed Datasets (RDDs)
• Provides richer distributed computing primitives for
various problems
• Can support SQL, stream processing, ML, graph
computation
• Supports Scala, Java, and Python
48

Spark uses a general DAG scheduler
• Application aware scheduler
• Uses locality for both disk
and memory
• Partitioning-aware
to avoid shuffles
• Can rewrite and optimize
graph based on analysis
join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= cached data partition

Operations on RDDs
50
Zaharia 2011

Apache Spark
51
file = spark.textFile("hdfs://...")
errors = file.filter(lambda line: "ERROR” in line)
# Count all the errors
errors.count()
# Count errors mentioning MySQL
errors.filter(lambda line: "MySQL” in line).count()
# Fetch the MySQL errors as an array of strings
errors.filter(lambda line: "MySQL” in line).collect()
val points = spark.textFile(...).map(parsePoint).cache()
var w = Vector.random(D) // current separating plane
for (i <- 1 to ITERATIONS) {
val gradient = points.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
w -= gradient
}
println("Final separating plane: " + w)
Logfiltering
(Python)
Logisticregression
(Scala)

What’s Impala?
• Interactive SQL
• Typically 4-65x faster than the latest Hive (observed up to 100x faster)
• Responses in seconds instead of minutes (sometimes sub-second)
• ANSI-92 standard SQL queries with HiveQL
• Compatible SQL interface for existing Hadoop/CDH applications
• Based on industry standard SQL
• Natively on Hadoop/HBase storage and metadata
• Flexibility, scale, and cost advantages of Hadoop
• No duplication/synchronization of data and metadata
• Local processing to avoid network bottlenecks
• Separate runtime from batch processing
• Hive, Pig, MapReduce are designed and great for batch
• Impala is purpose-built for low-latency SQL queries on Hadoop
Cloudera Confidential. ©2013
Cloudera, Inc. All Rights Reserved.
53

Cloudera Impala
54
SELECT cosmic as snp_id,
vcf_chrom as chr,
vcf_pos as pos,
sample_id as sample,
vcf_call_gt as genotype,
sample_affection as phenotype
FROM
hg19_parquet_snappy_join_cached_partitioned
WHERE
COSMIC IS NOT NULL AND
dbSNP IS NULL AND
sample_study = ”breast_cancer" AND
VCF_CHROM = "16";

Impala Architecture: Planner
• Example: query with join and aggregation
SELECT state, SUM(revenue)
FROM HdfsTbl h JOIN HbaseTbl b ON (...)
GROUP BY 1 ORDER BY 2 desc LIMIT 10
Hbase
Scan
Hash
Join
Hdfs
Scan
Exch
TopN
Agg
Exch
at coordinator at DataNodes at region servers
Agg
TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
Cloudera Confidential. ©2013
Cloudera, Inc. All Rights Reserved.
55

Impala User-defined Functions (UDFs)
• Tuple => Scalar value
• Substring
• sin, cos, pow, …
• Machine-learning models
• Supports Hive UDFs (Java)
• Highly unpleasurable
• Impala (native) UDFs
• C++ interface designed for efficiency
• Similar to Postgres UDFs
• Runs any LLVM-compiled code
56

LLVM compiler infrastructure
57

LLVM: C++ example
58
bool StringEq(FunctionContext* context,
const StringVal& arg1,
const StringVal& arg2) {
if (arg1.is_null != arg2.is_null)
return false;
if (arg1.is_null)
return true;
if (arg1.len != arg2.len)
return false;
return (arg1.ptr == arg2.ptr) ||
memcmp(arg1.ptr, arg2.ptr, arg1.len) == 0;
}

LLVM: IR output
59
; ModuleID = '<stdin>'
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
target triple = "x86_64-apple-macosx10.7.0"
%"class.impala_udf::FunctionContext" = type { %"class.impala::FunctionContextImpl"* }
%"class.impala::FunctionContextImpl" = type opaque
%"struct.impala_udf::StringVal" = type { %"struct.impala_udf::AnyVal", i32, i8* }
%"struct.impala_udf::AnyVal" = type { i8 }
; Function Attrs: nounwind readonly ssp uwtable
define zeroext i1 @_Z8StringEqPN10impala_udf15FunctionContextERKNS_9StringValES4_(%"class.impala_udf::FunctionContext"* nocapture %context, %"struct.impala_udf::StringVal"*
nocapture %arg1, %"struct.impala_udf::StringVal"* nocapture %arg2) #0 {
entry:
%is_null = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg1, i64 0, i32 0, i32 0
%0 = load i8* %is_null, align 1, !tbaa !0, !range !3
%is_null1 = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg2, i64 0, i32 0, i32 0
%1 = load i8* %is_null1, align 1, !tbaa !0, !range !3
%cmp = icmp eq i8 %0, %1
br i1 %cmp, label %if.end, label %return
if.end: ; preds = %entry
%tobool = icmp eq i8 %0, 0
br i1 %tobool, label %if.end7, label %return
if.end7: ; preds = %if.end
%len = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg1, i64 0, i32 1
%2 = load i32* %len, align 4, !tbaa !4
%len8 = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg2, i64 0, i32 1
%3 = load i32* %len8, align 4, !tbaa !4
%cmp9 = icmp eq i32 %2, %3
br i1 %cmp9, label %if.end11, label %return
if.end11: ; preds = %if.end7
%ptr = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg1, i64 0, i32 2
%4 = load i8** %ptr, align 8, !tbaa !5
%ptr12 = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg2, i64 0, i32 2
%5 = load i8** %ptr12, align 8, !tbaa !5
%cmp13 = icmp eq i8* %4, %5
br i1 %cmp13, label %return, label %lor.rhs
lor.rhs: ; preds = %if.end11
%conv17 = sext i32 %2 to i64
%call = tail call i32 @memcmp(i8* %4, i8* %5, i64 %conv17)
%cmp18 = icmp eq i32 %call, 0
br label %return

LLVM compiler infrastructure
60
NumbaPython

Iris data and BigML
61
def predict_species_orig(sepal_width=None,
petal_length=None,
petal_width=None):
""" Predictor for species from model/52952081035d07727e01d836
Predictive model by BigML - Machine Learning Made Easy
"""
if (petal_width is None):
return u'Iris-virginica'
if (petal_width > 0.8):
if (petal_width <= 1.75):
if (petal_length is None):
return u'Iris-versicolor'
if (petal_length > 4.95):
if (petal_length <= 5.45):
if (petal_length is None):
if (sepal_width is None):
if (sepal_width <= 3.1):
if (sepal_width > 3.1):
return u'Iris-setosa'

Impala + Numba
• Still pre-alpha
• Significantly faster execution thanks to native LLVM
• Significantly easier to write UDFs
63

65
If you have access to a Hadoop cluster and you want a
one-off quick-and-dirty job…
Hadoop Streaming

66
If you want an expressive Pythonic interface to build
complex, regular ETL workflows…
Luigi

67
If you want to integrate Hadoop with other regular
processes…
Luigi

68
If you don’t have access to Hadoop and want to try
stuff out…
mrjob

69
If you’re heavily using AWS…
mrjob

70
If you want to work interactively…
PySpark

71
If you want to do in-memory analytics…
PySpark

72
If you want to do anything…*
PySpark

73
If you want ease of Python with high performance
Impala + Numba

74
If you want to write Python UDFs for SQL queries…
Impala + Numba

75
Code:
Blog post:
http://blog.cloudera.com/blog/2013/01/a-guide-to-
python-frameworks-for-hadoop/
Slides:
http://www.slideshare.net/urilaserson/

Python in the Hadoop Ecosystem (Rock Health presentation)

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (6)

Similar to Python in the Hadoop Ecosystem (Rock Health presentation)

Similar to Python in the Hadoop Ecosystem (Rock Health presentation) (20)

More from Uri Laserson

More from Uri Laserson (6)

Recently uploaded

Recently uploaded (20)

Python in the Hadoop Ecosystem (Rock Health presentation)

Editor's Notes