SlideShare ist ein Scribd-Unternehmen logo
1 von 58
1
A	
  Guide	
  to	
  Python	
  Frameworks	
  for	
  Hadoop	
  
Uri	
  Laserson	
  |	
  Data	
  Scien>st	
  
laserson@cloudera.com	
  
14	
  June	
  3013	
  
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/python-hadoop
Presented at QCon New York
www.qconnewyork.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
About	
  the	
  speaker	
  
•  Joined	
  Cloudera	
  late	
  2012	
  
•  Focused	
  on	
  life	
  sciences/medical	
  
•  PhD	
  in	
  BME/computa>onal	
  biology	
  at	
  MIT/Harvard	
  
(2005-­‐2012)	
  
•  Focused	
  on	
  genomics	
  
•  Cofounded	
  Good	
  Start	
  Gene>cs	
  (2007-­‐)	
  
•  Applying	
  next-­‐gen	
  DNA	
  sequencing	
  to	
  gene>c	
  carrier	
  
screening	
  
2
About	
  the	
  speaker	
  
•  No	
  formal	
  training	
  in	
  computer	
  science	
  
•  Never	
  touched	
  Java	
  
•  Almost	
  all	
  work	
  using	
  Python	
  
3
4	
  
Python	
  frameworks	
  for	
  Hadoop	
  
•  Hadoop	
  Streaming	
  
•  mrjob	
  (Yelp)	
  
•  dumbo	
  
•  Luigi	
  (Spo>fy)	
  
•  hadoopy	
  
•  pydoop	
  
•  PySpark	
  
•  happy	
  
•  Disco	
  
•  octopy	
  
•  Mortar	
  Data	
  
•  Pig	
  UDF/Jython	
  
•  hipy	
  
5
Goals	
  for	
  Python	
  framework	
  
1.  “Pseudocodiness”/simplicity	
  
2.  Flexibility/generality	
  
3.  Ease	
  of	
  use/installa>on	
  
4.  Performance	
  
6
7	
  
An	
  n-­‐gram	
  is	
  a	
  tuple	
  of	
  n	
  words.	
  
Problem:	
  aggrega>ng	
  the	
  Google	
  n-­‐gram	
  data	
  
h_p://books.google.com/ngrams	
  
8	
  
An	
  n-­‐gram	
  is	
  a	
  tuple	
  of	
  n	
  words.	
  
Problem:	
  aggrega>ng	
  the	
  Google	
  n-­‐gram	
  data	
  
h_p://books.google.com/ngrams	
  
1	
   2	
   3	
   4	
   5	
   6	
   7	
   8	
  
(	
   )	
  
8-­‐gram	
  
9	
  
"A	
  par'al	
  differen'al	
  equa'on	
  is	
  an	
  equa'on	
  that	
  contains	
  par'al	
  deriva'ves."	
  
10	
  
A	
  par'al	
  differen'al	
  equa'on	
  is	
  an	
  equa'on	
  that	
  contains	
  par'al	
  deriva'ves.	
  
A 1!
partial 2!
differential 1!
equation 2!
is 1!
an 1!
that 1!
contains 1!
derivatives. 1!
1-­‐grams	
  
11	
  
A	
  par'al	
  differen'al	
  equa'on	
  is	
  an	
  equa'on	
  that	
  contains	
  par'al	
  deriva'ves.	
  
A partial 1!
partial differential 1!
differential equation 1!
equation is 1!
is an 1!
an equation 1!
equation that 1!
that contains 1!
contains partial 1!
partial derivatives. 1!
2-­‐grams	
  
12	
  
A	
  par'al	
  differen'al	
  equa'on	
  is	
  an	
  equa'on	
  that	
  contains	
  par'al	
  deriva'ves.	
  
A partial differential equation is 1!
partial differential equation is an 1!
differential equation is an equation 1!
equation is an equation that 1!
is an equation that contains 1!
an equation that contains partial 1!
equation that contains partial derivatives. 1!
5-­‐grams	
  
13
14	
  
flourished in 1993 2 2 2!
flourished in 1998 2 2 1!
flourished in 1999 6 6 4!
flourished in 2000 5 5 5!
flourished in 2001 1 1 1!
flourished in 2002 7 7 3!
flourished in 2003 9 9 4!
flourished in 2004 22 21 13!
flourished in 2005 37 37 22!
flourished in 2006 55 55 38!
flourished in 2007 99 98 76!
flourished in 2008 220 215 118!
fluid of 1899 2 2 1!
fluid of 2000 3 3 1!
fluid of 2002 2 1 1!
fluid of 2003 3 3 1!
fluid of 2004 3 3 3!
2-­‐gram	
   year	
   matches	
   pages	
   volumes	
  
15	
  
Compute	
  how	
  ocen	
  two	
  words	
  are	
  near	
  each	
  
other	
  in	
  a	
  given	
  year.	
  
Two	
  words	
  are	
  “near”	
  if	
  they	
  are	
  both	
  
present	
  in	
  a	
  2-­‐,	
  3,	
  4-­‐,	
  or	
  5-­‐gram.	
  
16	
  
...2-grams...!
(cat, the) 1999 14!
(the, cat) 1999 7002!
!
...3-grams...!
(the, cheshire, cat) 1999 563!
!
...4-grams...!
!
...5-grams...!
(the, cat, in, the, hat) 1999 1023!
(the, dog, chased, the, cat) 1999 403!
(cat, is, one, of, the) 1999 24!
(cat, the) 1999 8006!
(hat, the) 1999 1023!
raw	
  data	
  
aggregated	
  results	
  
lexicographic	
  
ordering	
  
internal	
  n-­‐grams	
  counted	
  by	
  smaller	
  n-­‐grams:	
  
•  avoids	
  double-­‐coun>ng	
  
•  increases	
  sensi>vity	
  (observed	
  at	
  least	
  40	
  >mes)	
  
Pseudocode	
  for	
  MapReduce	
  
17
def map(record):!
    (ngram, year, count) = unpack(record)!
    // ensure word1 has the lexicographically first word:!
    (word1, word2) = sorted(ngram[first], ngram[last])!
    key = (word1, word2, year)!
    emit(key, count)!
 !
!
def reduce(key, values):!
    emit(key, sum(values))!
All	
  source	
  code	
  available	
  on	
  GitHub:	
  
h_ps://github.com/cloudera/python-­‐ngrams	
  
Na>ve	
  Java	
  
18
import org.apache.hadoop.conf.Configured;!
import org.apache.hadoop.fs.Path;!
import org.apache.hadoop.io.IntWritable;!
import org.apache.hadoop.mapreduce.Job;!
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;!
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;!
import org.apache.hadoop.util.Tool;!
import org.apache.hadoop.util.ToolRunner;!
!
!
public class NgramsDriver extends Configured implements Tool {!
!
public int run(String[] args) throws Exception {!
Job job = new Job(getConf());!
job.setJarByClass(getClass());!
!
FileInputFormat.addInputPath(job, new Path(args[0]));!
FileOutputFormat.setOutputPath(job, new Path(args[1]));!
!
job.setMapperClass(NgramsMapper.class);!
job.setCombinerClass(NgramsReducer.class);!
job.setReducerClass(NgramsReducer.class);!
!
job.setOutputKeyClass(TextTriple.class);!
job.setOutputValueClass(IntWritable.class);!
!
job.setNumReduceTasks(10);!
!
return job.waitForCompletion(true) ? 0 : 1;!
}!
!
public static void main(String[] args) throws Exception {!
int exitCode = ToolRunner.run(new NgramsDriver(), args);!
System.exit(exitCode);!
}!
}!
!
import java.io.IOException;!
import java.util.ArrayList;!
import java.util.Collections;!
import java.util.List;!
import java.util.regex.Matcher;!
import java.util.regex.Pattern;!
!
import org.apache.hadoop.io.IntWritable;!
import org.apache.hadoop.io.LongWritable;!
import org.apache.hadoop.io.Text;!
import org.apache.hadoop.mapreduce.Mapper;!
import org.apache.hadoop.mapreduce.lib.input.FileSplit;!
import org.apache.log4j.Logger;!
!
!
public class NgramsMapper extends Mapper<LongWritable, Text, TextTriple, IntWritable> {!
!
private Logger LOG = Logger.getLogger(getClass());!
!
private int expectedTokens;!
!
@Override!
protected void setup(Context context) throws IOException, InterruptedException {!
String inputFile = ((FileSplit) context.getInputSplit()).getPath().getName();!
LOG.info("inputFile: " + inputFile);!
Pattern c = Pattern.compile("([d]+)gram");!
Matcher m = c.matcher(inputFile);!
m.find();!
expectedTokens = Integer.parseInt(m.group(1));!
return;!
}!
!
@Override!
public void map(LongWritable key, Text value, Context context)!
throws IOException, InterruptedException {!
String[] data = value.toString().split("t");!
!
if (data.length < 3) {!
return;!
}!
!
String[] ngram = data[0].split("s+");!
String year = data[1];!
IntWritable count = new IntWritable(Integer.parseInt(data[2]));!
!
if (ngram.length != this.expectedTokens) {!
return;!
}!
!
// build keyOut!
List<String> triple = new ArrayList<String>(3);!
triple.add(ngram[0]);!
triple.add(ngram[expectedTokens - 1]);!
Collections.sort(triple);!
triple.add(year);!
TextTriple keyOut = new TextTriple(triple);!
!
context.write(keyOut, count);!
}!
}!
import java.io.IOException;!
!
import org.apache.hadoop.io.IntWritable;!
import org.apache.hadoop.mapreduce.Reducer;!
!
!
public class NgramsReducer extends Reducer<TextTriple, IntWritable, TextTriple, IntWritable>
!
@Override!
protected void reduce(TextTriple key, Iterable<IntWritable> values, Context context)!
throws IOException, InterruptedException {!
int sum = 0;!
for (IntWritable value : values) {!
sum += value.get();!
}!
context.write(key, new IntWritable(sum));!
}!
}!
!
import java.io.DataInput;!
import java.io.DataOutput;!
import java.io.IOException;!
import java.util.List;!
!
import org.apache.hadoop.io.Text;!
import org.apache.hadoop.io.WritableComparable;!
!
!
public class TextTriple implements WritableComparable<TextTriple> {!
!
private Text first;!
private Text second;!
private Text third;!
!
public TextTriple() {!
set(new Text(), new Text(), new Text());!
}!
!
public TextTriple(List<String> list) {!
set(new Text(list.get(0)),!
new Text(list.get(1)),!
new Text(list.get(2)));!
}!
!
public void set(Text first, Text second, Text third) {!
this.first = first;!
this.second = second;!
this.third = third;!
}!
!
public void write(DataOutput out) throws IOException {!
first.write(out);!
second.write(out);!
third.write(out);!
}!
!
public void readFields(DataInput in) throws IOException {!
first.readFields(in);!
second.readFields(in);!
third.readFields(in);!
}!
!
@Override!
public int hashCode() {!
return first.hashCode() * 163 + second.hashCode() * 31 + third.hashCode();!
}!
!
@Override!
public boolean equals(Object obj) {!
if (obj instanceof TextTriple) {!
TextTriple tt = (TextTriple) obj;!
return first.equals(tt.first) && second.equals(tt.second) && third.equals(tt.thir
}!
return false;!
}!
!
@Override!
public String toString() {!
return first + "t" + second + "t" + third;!
}!
!
public int compareTo(TextTriple other) {!
int comp = first.compareTo(other.first);!
if (comp != 0) {!
return comp;!
}!
comp = second.compareTo(other.second);!
if (comp != 0) {!
return comp;!
}!
return third.compareTo(other.third);!
} !
}!
Na>ve	
  Java	
  
•  Maximum	
  flexibility	
  
•  Fastest	
  performance	
  
•  Na>ve	
  to	
  Hadoop	
  
•  Most	
  difficult	
  to	
  write	
  
19
Python	
  implementa>on	
  strategies	
  
•  Hadoop	
  Streaming	
  
•  mrjob	
  
•  dumbo	
  
•  hadoopy	
  
•  Hadoop	
  Pipes	
  
•  pydoop	
  
•  Non-­‐Hadoop	
  
•  Disco	
  
•  octopy	
  
20
Hadoop	
  Streaming:	
  execu>on	
  
21
hadoop jar hadoop-streaming-2.0.0-mr1-cdh4.1.2.jar !
-input /ngrams !
-output /output-streaming !
-mapper mapper.py !
-combiner reducer.py !
-reducer reducer.py !
-jobconf stream.num.map.output.key.fields=3 !
-jobconf stream.num.reduce.output.key.fields=3 !
-jobconf mapred.reduce.tasks=10 !
-file mapper.py !
-file reducer.py!
Hadoop	
  Streaming:	
  code	
  
22
Hadoop	
  Streaming:	
  features	
  
•  Canonical	
  method	
  for	
  using	
  any	
  executable	
  as	
  
mapper/reducer	
  
•  Includes	
  shell	
  commands,	
  like	
  grep	
  
•  Transparent	
  communica>on	
  with	
  Hadoop	
  though	
  
stdin/stdout	
  
•  Key	
  boundaries	
  manually	
  detected	
  in	
  reducer	
  
•  Built-­‐in	
  with	
  Hadoop:	
  should	
  require	
  no	
  addi>onal	
  
framework	
  installa>on	
  
•  Developer	
  must	
  decide	
  how	
  to	
  encode	
  more	
  
complicated	
  objects	
  (e.g.,	
  JSON)	
  or	
  binary	
  data	
  
23
mrjob	
  
24
class NgramNeighbors(MRJob):!
# specify input/intermed/output serialization!
# default output protocol is JSON; here we set it to text!
OUTPUT_PROTOCOL = RawProtocol!
!
def mapper(self, key, line):!
pass!
!
def combiner(self, key, counts):!
pass!
!
def reducer(self, key, counts):!
pass!
!
if __name__ == '__main__':!
# sets up a runner, based on command line options!
NgramNeighbors.run()!
!
!
mrjob:	
  runner	
  
25
./ngrams.py -r hadoop !
--hadoop-bin /usr/bin/hadoop !
--jobconf mapred.reduce.tasks=10 !
-o hdfs:///output-mrjob !
hdfs:///ngrams!
mrjob:	
  code	
  
26
mrjob:	
  features	
  
•  Abstracted	
  MapReduce	
  interface	
  
•  Handles	
  complex	
  Python	
  objects	
  
•  Mul>-­‐step	
  MapReduce	
  workflows	
  
•  Extremely	
  >ght	
  AWS	
  integra>on	
  
•  Easily	
  choose	
  to	
  run	
  locally,	
  on	
  Hadoop	
  cluster,	
  or	
  on	
  
EMR	
  
•  Ac>vely	
  developed;	
  great	
  documenta>on	
  
27
mrjob:	
  serializa>on	
  
28
class MyMRJob(mrjob.job.MRJob):!
INPUT_PROTOCOL = mrjob.protocol.RawValueProtocol!
INTERNAL_PROTOCOL = mrjob.protocol.JSONProtocol!
OUTPUT_PROTOCOL = mrjob.protocol.JSONProtocol!
Defaults	
  
RawProtocol / RawValueProtocol!
JSONProtocol / JSONValueProtocol!
PickleProtocol / PickleValueProtocol!
ReprProtocol / ReprValueProtocol!
Available	
  
Custom	
  protocols	
  can	
  be	
  wri_en.	
  
No	
  current	
  support	
  for	
  binary	
  serializa>on	
  schemes.	
  
dumbo	
  
•  Similar	
  in	
  spirit	
  to	
  mrjob	
  
•  abstracted	
  
•  complex	
  objects	
  
•  various	
  runners	
  
•  composable	
  jobs	
  
•  Sporadically	
  developed?	
  
•  Documenta>on	
  is	
  a	
  series	
  of	
  blog	
  posts	
  
29
dumbo:	
  serializa>on	
  
•  Typed	
  bytes	
  added	
  to	
  Hadoop	
  allowing	
  binary	
  data	
  
•  ctypedbytes	
  
•  binary	
  serializa>on	
  
•  packs	
  Python	
  objects	
  in	
  C	
  structs	
  
•  Much	
  faster	
  and	
  more	
  efficient	
  than	
  JSON	
  or	
  pickle	
  
•  Na>vely	
  read	
  SequenceFile	
  
•  Execute	
  code	
  from	
  any	
  Python	
  egg	
  or	
  JAR	
  
•  Point	
  to	
  any	
  Java	
  InputFormat!
30
dumbo:	
  installa>on	
  notes	
  
•  Required	
  manual	
  install	
  on	
  each	
  node	
  
•  dumbo	
  and	
  typedbytes	
  had	
  to	
  be	
  installed	
  as	
  Python	
  
eggs	
  
•  Had	
  trouble	
  running	
  a	
  combiner	
  due	
  to	
  
MemoryErrors!
31
hadoopy	
  
•  Similar	
  to	
  dumbo,	
  with	
  be_er	
  docs	
  
•  Typedbytes	
  serializa>on	
  
•  Experimental	
  Hbase	
  integra>on	
  
•  Allows	
  launching	
  python	
  jobs	
  even	
  on	
  nodes	
  that	
  do	
  
not	
  have	
  Python	
  
•  No	
  command	
  line	
  u>lity:	
  must	
  launch	
  MR	
  jobs	
  within	
  
a	
  python	
  program	
  
32
pydoop	
  
•  Wraps	
  Hadoop	
  Pipes	
  (C++	
  API)	
  instead	
  of	
  Streaming	
  
•  HDFS	
  commands	
  communicate	
  through	
  libhdfs	
  rather	
  
than	
  shell	
  
•  Ability	
  to	
  implement	
  a	
  Python	
  Partitioner,	
  
RecordReader,	
  and	
  RecordWriter!
•  All	
  input/output	
  must	
  be	
  strings	
  
•  Could	
  not	
  install	
  it	
  
33
luigi	
  
•  Full-­‐fledged	
  workflow	
  management,	
  task	
  scheduling,	
  
dependency	
  resolu>on	
  tool	
  in	
  Python	
  (similar	
  to	
  
Apache	
  Oozie)	
  
•  Built-­‐in	
  support	
  for	
  Hadoop	
  by	
  wrapping	
  Streaming	
  
•  Not	
  as	
  fully-­‐featured	
  as	
  mrjob	
  for	
  Hadoop,	
  but	
  easily	
  
customizable	
  
•  Internal	
  serializa>on	
  through	
  repr/eval	
  
•  Ac>vely	
  developed	
  at	
  Spo>fy	
  
•  README	
  is	
  good	
  but	
  documenta>on	
  is	
  lacking	
  
34
luigi:	
  runner	
  
35
python ngrams.py Ngrams !
--local-scheduler !
--n-reduce-tasks 10 !
--source /ngrams !
--destination /output-luigi!
luigi:	
  code	
  
36
Python	
  frameworks	
  for	
  Hadoop	
  
•  Hadoop	
  Streaming	
  ✓	
  
•  mrjob	
  (Yelp)	
  ✓	
  
•  dumbo✓	
  
•  Luigi	
  (Spo>fy)	
  ✓	
  
•  hadoopy✓	
  
•  pydoop❌	
  
•  PySpark	
  
•  happy	
  
•  Disco	
  
•  octopy	
  
•  Mortar	
  Data	
  
•  Pig	
  UDF/Jython	
  
•  hipy	
  
37
Python	
  frameworks	
  for	
  Hadoop	
  
•  Hadoop	
  Streaming	
  ✓	
  
•  mrjob	
  (Yelp)	
  ✓	
  
•  dumbo✓	
  
•  Luigi	
  (Spo>fy)	
  ✓	
  
•  hadoopy✓	
  
•  pydoop❌	
  
•  PySpark	
  not	
  Hadoop	
  
•  happy	
  
•  Disco	
  
•  octopy	
  
•  Mortar	
  Data	
  
•  Pig	
  UDF/Jython	
  
•  hipy	
  
38
Python	
  frameworks	
  for	
  Hadoop	
  
•  Hadoop	
  Streaming	
  ✓	
  
•  mrjob	
  (Yelp)	
  ✓	
  
•  dumbo✓	
  
•  Luigi	
  (Spo>fy)	
  ✓	
  
•  hadoopy✓	
  
•  pydoop❌	
  
•  PySpark	
  not	
  Hadoop	
  
•  happy	
  abandoned?	
  Jython-­‐based	
  
•  Disco	
  
•  octopy	
  
•  Mortar	
  Data	
  
•  Pig	
  UDF/Jython	
  
•  hipy	
  
39
Python	
  frameworks	
  for	
  Hadoop	
  
•  Hadoop	
  Streaming	
  ✓	
  
•  mrjob	
  (Yelp)	
  ✓	
  
•  dumbo✓	
  
•  Luigi	
  (Spo>fy)	
  ✓	
  
•  hadoopy✓	
  
•  pydoop❌	
  
•  PySpark	
  not	
  Hadoop	
  
•  happy	
  abandoned?	
  Jython-­‐based	
  
•  Disco	
  not	
  Hadoop	
  
•  octopy	
  
•  Mortar	
  Data	
  
•  Pig	
  UDF/Jython	
  
•  hipy	
  
40
Python	
  frameworks	
  for	
  Hadoop	
  
•  Hadoop	
  Streaming	
  ✓	
  
•  mrjob	
  (Yelp)	
  ✓	
  
•  dumbo✓	
  
•  Luigi	
  (Spo>fy)	
  ✓	
  
•  hadoopy✓	
  
•  pydoop❌	
  
•  PySpark	
  not	
  Hadoop	
  
•  happy	
  abandoned?	
  Jython-­‐based	
  
•  Disco	
  not	
  Hadoop	
  
•  octopy	
  not	
  serious/not	
  Hadoop	
  
•  Mortar	
  Data	
  
•  Pig	
  UDF/Jython	
  
•  hipy	
  
41
Python	
  frameworks	
  for	
  Hadoop	
  
•  Hadoop	
  Streaming	
  ✓	
  
•  mrjob	
  (Yelp)	
  ✓	
  
•  dumbo✓	
  
•  Luigi	
  (Spo>fy)	
  ✓	
  
•  hadoopy✓	
  
•  pydoop❌	
  
•  PySpark	
  not	
  Hadoop	
  
•  happy	
  abandoned?	
  Jython-­‐based	
  
•  Disco	
  not	
  Hadoop	
  
•  octopy	
  not	
  serious/not	
  Hadoop	
  
•  Mortar	
  Data	
  HaaS;	
  support	
  numpy,	
  scipy,	
  nltk,	
  pip-­‐installable	
  in	
  UDF	
  
•  Pig	
  UDF/Jython	
  
•  hipy	
  
42
Python	
  frameworks	
  for	
  Hadoop	
  
•  Hadoop	
  Streaming	
  ✓	
  
•  mrjob	
  (Yelp)	
  ✓	
  
•  dumbo✓	
  
•  Luigi	
  (Spo>fy)	
  ✓	
  
•  hadoopy✓	
  
•  pydoop❌	
  
•  PySpark	
  not	
  Hadoop	
  
•  happy	
  abandoned?	
  Jython-­‐based	
  
•  Disco	
  not	
  Hadoop	
  
•  octopy	
  not	
  serious/not	
  Hadoop	
  
•  Mortar	
  Data	
  HaaS;	
  support	
  numpy,	
  scipy,	
  nltk,	
  pip-­‐installable	
  in	
  UDF	
  
•  Pig	
  UDF/Jython	
  Pig	
  is	
  another	
  talk;	
  Jython	
  limited	
  
•  hipy	
  
43
Python	
  frameworks	
  for	
  Hadoop	
  
•  Hadoop	
  Streaming	
  ✓	
  
•  mrjob	
  (Yelp)	
  ✓	
  
•  dumbo✓	
  
•  Luigi	
  (Spo>fy)	
  ✓	
  
•  hadoopy✓	
  
•  pydoop❌	
  
•  PySpark	
  not	
  Hadoop	
  
•  happy	
  abandoned?	
  Jython-­‐based	
  
•  Disco	
  not	
  Hadoop	
  
•  octopy	
  not	
  serious/not	
  Hadoop	
  
•  Mortar	
  Data	
  HaaS;	
  support	
  numpy,	
  scipy,	
  nltk,	
  pip-­‐installable	
  in	
  UDF	
  
•  Pig	
  UDF/Jython	
  Pig	
  is	
  another	
  talk;	
  Jython	
  limited	
  
•  hipy	
  Python	
  syntac>c	
  sugar	
  to	
  construct	
  Hive	
  queries	
  
44
Commit	
  ac>vity	
  
45
mrjob	
  
dumbo	
  
Commit	
  ac>vity	
  
46
luigi	
  
hadoopy	
  
The	
  cluster	
  
•  5	
  virtual	
  machines	
  
•  4	
  CPUs	
  
•  10	
  GB	
  RAM	
  
•  100	
  GB	
  disk	
  
•  CentOS	
  6.2	
  
•  CDH4	
  (Hadoop	
  2)	
  
•  20	
  map	
  tasks	
  
•  10	
  reduce	
  tasks	
  
•  Python	
  2.6	
  
47
(Unscien>fic)	
  performance	
  comparison	
  
48
(Unscien>fic)	
  performance	
  comparison	
  
49
Streaming	
  has	
  
lowest	
  overhead	
  
(Unscien>fic)	
  performance	
  comparison	
  
50
JSON	
  SerDe	
  
(Unscien>fic)	
  performance	
  comparison	
  
51
Combiner	
  was	
  not	
  used	
  
Feature	
  comparison	
  
52
Feature	
  comparison	
  
53
Conclusions	
  
•  Prefer	
  Hadoop	
  Streaming	
  if	
  possible	
  
•  It’s	
  easy	
  enough	
  
•  Lowest	
  overhead	
  
•  Prefer	
  mrjob	
  for	
  higher	
  abstrac>on	
  
•  Ac>vely	
  developed/great	
  documenta>on	
  
•  Feature-­‐rich	
  (incl.	
  composable	
  jobs)	
  
•  Integra>on	
  with	
  AWS	
  
•  Prefer	
  luigi	
  for	
  more	
  complicated	
  job	
  flows	
  
•  Ac>vely	
  developed	
  
•  Much	
  more	
  general	
  than	
  purely	
  Hadoop	
  
54
55
Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/python-
hadoop

Weitere ähnliche Inhalte

Andere mochten auch

Times Ten in-memory database when time counts - Laszlo Ludas
Times Ten in-memory database when time counts - Laszlo LudasTimes Ten in-memory database when time counts - Laszlo Ludas
Times Ten in-memory database when time counts - Laszlo LudasORACLE USER GROUP ESTONIA
 
The Craftsman Developer In An Agile World
The Craftsman Developer In An Agile WorldThe Craftsman Developer In An Agile World
The Craftsman Developer In An Agile WorldOpenAgile Romania
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about HadoopDonald Miner
 
D3 Basic Tutorial
D3 Basic TutorialD3 Basic Tutorial
D3 Basic TutorialTao Jiang
 
Building end to end streaming application on Spark
Building end to end streaming application on SparkBuilding end to end streaming application on Spark
Building end to end streaming application on Sparkdatamantra
 
Improving Mobile Payments With Real time Spark
Improving Mobile Payments With Real time SparkImproving Mobile Payments With Real time Spark
Improving Mobile Payments With Real time Sparkdatamantra
 
ASP.NET Core: The best of the new bits
ASP.NET Core: The best of the new bitsASP.NET Core: The best of the new bits
ASP.NET Core: The best of the new bitsKen Cenerelli
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduceFrane Bandov
 
Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data AnalyticsEdureka!
 
Streaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit LogStreaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit LogJoe Stein
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsSkillspeed
 
Intro to Amazon S3
Intro to Amazon S3Intro to Amazon S3
Intro to Amazon S3Yu Lun Teo
 
Interactive workflow management using Azkaban
Interactive workflow management using AzkabanInteractive workflow management using Azkaban
Interactive workflow management using Azkabandatamantra
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flinkdatamantra
 

Andere mochten auch (20)

Times Ten in-memory database when time counts - Laszlo Ludas
Times Ten in-memory database when time counts - Laszlo LudasTimes Ten in-memory database when time counts - Laszlo Ludas
Times Ten in-memory database when time counts - Laszlo Ludas
 
The Craftsman Developer In An Agile World
The Craftsman Developer In An Agile WorldThe Craftsman Developer In An Agile World
The Craftsman Developer In An Agile World
 
Learning d3
Learning d3Learning d3
Learning d3
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop
 
D3 Basic Tutorial
D3 Basic TutorialD3 Basic Tutorial
D3 Basic Tutorial
 
D3.js mindblow
D3.js mindblowD3.js mindblow
D3.js mindblow
 
Anatomy of file write in hadoop
Anatomy of file write in hadoopAnatomy of file write in hadoop
Anatomy of file write in hadoop
 
D3 data visualization
D3 data visualizationD3 data visualization
D3 data visualization
 
Building end to end streaming application on Spark
Building end to end streaming application on SparkBuilding end to end streaming application on Spark
Building end to end streaming application on Spark
 
Improving Mobile Payments With Real time Spark
Improving Mobile Payments With Real time SparkImproving Mobile Payments With Real time Spark
Improving Mobile Payments With Real time Spark
 
ASP.NET Core: The best of the new bits
ASP.NET Core: The best of the new bitsASP.NET Core: The best of the new bits
ASP.NET Core: The best of the new bits
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data Analytics
 
Streaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit LogStreaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit Log
 
.Net Core
.Net Core.Net Core
.Net Core
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
 
Intro to Amazon S3
Intro to Amazon S3Intro to Amazon S3
Intro to Amazon S3
 
Interactive workflow management using Azkaban
Interactive workflow management using AzkabanInteractive workflow management using Azkaban
Interactive workflow management using Azkaban
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
 
Deep Dive on Amazon S3
Deep Dive on Amazon S3Deep Dive on Amazon S3
Deep Dive on Amazon S3
 

Mehr von C4Media

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoC4Media
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileC4Media
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020C4Media
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsC4Media
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No KeeperC4Media
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like OwnersC4Media
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaC4Media
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideC4Media
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDC4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine LearningC4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at SpeedC4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsC4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerC4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleC4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeC4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereC4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing ForC4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
 

Mehr von C4Media (20)

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy Mobile
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java Applications
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No Keeper
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like Owners
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 

Kürzlich hochgeladen

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 

Kürzlich hochgeladen (20)

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 

A Guide to Python Frameworks for Hadoop

  • 1. 1 A  Guide  to  Python  Frameworks  for  Hadoop   Uri  Laserson  |  Data  Scien>st   laserson@cloudera.com   14  June  3013  
  • 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /python-hadoop
  • 3. Presented at QCon New York www.qconnewyork.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4. About  the  speaker   •  Joined  Cloudera  late  2012   •  Focused  on  life  sciences/medical   •  PhD  in  BME/computa>onal  biology  at  MIT/Harvard   (2005-­‐2012)   •  Focused  on  genomics   •  Cofounded  Good  Start  Gene>cs  (2007-­‐)   •  Applying  next-­‐gen  DNA  sequencing  to  gene>c  carrier   screening   2
  • 5. About  the  speaker   •  No  formal  training  in  computer  science   •  Never  touched  Java   •  Almost  all  work  using  Python   3
  • 7. Python  frameworks  for  Hadoop   •  Hadoop  Streaming   •  mrjob  (Yelp)   •  dumbo   •  Luigi  (Spo>fy)   •  hadoopy   •  pydoop   •  PySpark   •  happy   •  Disco   •  octopy   •  Mortar  Data   •  Pig  UDF/Jython   •  hipy   5
  • 8. Goals  for  Python  framework   1.  “Pseudocodiness”/simplicity   2.  Flexibility/generality   3.  Ease  of  use/installa>on   4.  Performance   6
  • 9. 7   An  n-­‐gram  is  a  tuple  of  n  words.   Problem:  aggrega>ng  the  Google  n-­‐gram  data   h_p://books.google.com/ngrams  
  • 10. 8   An  n-­‐gram  is  a  tuple  of  n  words.   Problem:  aggrega>ng  the  Google  n-­‐gram  data   h_p://books.google.com/ngrams   1   2   3   4   5   6   7   8   (   )   8-­‐gram  
  • 11. 9   "A  par'al  differen'al  equa'on  is  an  equa'on  that  contains  par'al  deriva'ves."  
  • 12. 10   A  par'al  differen'al  equa'on  is  an  equa'on  that  contains  par'al  deriva'ves.   A 1! partial 2! differential 1! equation 2! is 1! an 1! that 1! contains 1! derivatives. 1! 1-­‐grams  
  • 13. 11   A  par'al  differen'al  equa'on  is  an  equa'on  that  contains  par'al  deriva'ves.   A partial 1! partial differential 1! differential equation 1! equation is 1! is an 1! an equation 1! equation that 1! that contains 1! contains partial 1! partial derivatives. 1! 2-­‐grams  
  • 14. 12   A  par'al  differen'al  equa'on  is  an  equa'on  that  contains  par'al  deriva'ves.   A partial differential equation is 1! partial differential equation is an 1! differential equation is an equation 1! equation is an equation that 1! is an equation that contains 1! an equation that contains partial 1! equation that contains partial derivatives. 1! 5-­‐grams  
  • 15. 13
  • 16. 14   flourished in 1993 2 2 2! flourished in 1998 2 2 1! flourished in 1999 6 6 4! flourished in 2000 5 5 5! flourished in 2001 1 1 1! flourished in 2002 7 7 3! flourished in 2003 9 9 4! flourished in 2004 22 21 13! flourished in 2005 37 37 22! flourished in 2006 55 55 38! flourished in 2007 99 98 76! flourished in 2008 220 215 118! fluid of 1899 2 2 1! fluid of 2000 3 3 1! fluid of 2002 2 1 1! fluid of 2003 3 3 1! fluid of 2004 3 3 3! 2-­‐gram   year   matches   pages   volumes  
  • 17. 15   Compute  how  ocen  two  words  are  near  each   other  in  a  given  year.   Two  words  are  “near”  if  they  are  both   present  in  a  2-­‐,  3,  4-­‐,  or  5-­‐gram.  
  • 18. 16   ...2-grams...! (cat, the) 1999 14! (the, cat) 1999 7002! ! ...3-grams...! (the, cheshire, cat) 1999 563! ! ...4-grams...! ! ...5-grams...! (the, cat, in, the, hat) 1999 1023! (the, dog, chased, the, cat) 1999 403! (cat, is, one, of, the) 1999 24! (cat, the) 1999 8006! (hat, the) 1999 1023! raw  data   aggregated  results   lexicographic   ordering   internal  n-­‐grams  counted  by  smaller  n-­‐grams:   •  avoids  double-­‐coun>ng   •  increases  sensi>vity  (observed  at  least  40  >mes)  
  • 19. Pseudocode  for  MapReduce   17 def map(record):!     (ngram, year, count) = unpack(record)!     // ensure word1 has the lexicographically first word:!     (word1, word2) = sorted(ngram[first], ngram[last])!     key = (word1, word2, year)!     emit(key, count)!  ! ! def reduce(key, values):!     emit(key, sum(values))! All  source  code  available  on  GitHub:   h_ps://github.com/cloudera/python-­‐ngrams  
  • 20. Na>ve  Java   18 import org.apache.hadoop.conf.Configured;! import org.apache.hadoop.fs.Path;! import org.apache.hadoop.io.IntWritable;! import org.apache.hadoop.mapreduce.Job;! import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;! import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;! import org.apache.hadoop.util.Tool;! import org.apache.hadoop.util.ToolRunner;! ! ! public class NgramsDriver extends Configured implements Tool {! ! public int run(String[] args) throws Exception {! Job job = new Job(getConf());! job.setJarByClass(getClass());! ! FileInputFormat.addInputPath(job, new Path(args[0]));! FileOutputFormat.setOutputPath(job, new Path(args[1]));! ! job.setMapperClass(NgramsMapper.class);! job.setCombinerClass(NgramsReducer.class);! job.setReducerClass(NgramsReducer.class);! ! job.setOutputKeyClass(TextTriple.class);! job.setOutputValueClass(IntWritable.class);! ! job.setNumReduceTasks(10);! ! return job.waitForCompletion(true) ? 0 : 1;! }! ! public static void main(String[] args) throws Exception {! int exitCode = ToolRunner.run(new NgramsDriver(), args);! System.exit(exitCode);! }! }! ! import java.io.IOException;! import java.util.ArrayList;! import java.util.Collections;! import java.util.List;! import java.util.regex.Matcher;! import java.util.regex.Pattern;! ! import org.apache.hadoop.io.IntWritable;! import org.apache.hadoop.io.LongWritable;! import org.apache.hadoop.io.Text;! import org.apache.hadoop.mapreduce.Mapper;! import org.apache.hadoop.mapreduce.lib.input.FileSplit;! import org.apache.log4j.Logger;! ! ! public class NgramsMapper extends Mapper<LongWritable, Text, TextTriple, IntWritable> {! ! private Logger LOG = Logger.getLogger(getClass());! ! private int expectedTokens;! ! @Override! protected void setup(Context context) throws IOException, InterruptedException {! String inputFile = ((FileSplit) context.getInputSplit()).getPath().getName();! LOG.info("inputFile: " + inputFile);! Pattern c = Pattern.compile("([d]+)gram");! Matcher m = c.matcher(inputFile);! m.find();! expectedTokens = Integer.parseInt(m.group(1));! return;! }! ! @Override! public void map(LongWritable key, Text value, Context context)! throws IOException, InterruptedException {! String[] data = value.toString().split("t");! ! if (data.length < 3) {! return;! }! ! String[] ngram = data[0].split("s+");! String year = data[1];! IntWritable count = new IntWritable(Integer.parseInt(data[2]));! ! if (ngram.length != this.expectedTokens) {! return;! }! ! // build keyOut! List<String> triple = new ArrayList<String>(3);! triple.add(ngram[0]);! triple.add(ngram[expectedTokens - 1]);! Collections.sort(triple);! triple.add(year);! TextTriple keyOut = new TextTriple(triple);! ! context.write(keyOut, count);! }! }! import java.io.IOException;! ! import org.apache.hadoop.io.IntWritable;! import org.apache.hadoop.mapreduce.Reducer;! ! ! public class NgramsReducer extends Reducer<TextTriple, IntWritable, TextTriple, IntWritable> ! @Override! protected void reduce(TextTriple key, Iterable<IntWritable> values, Context context)! throws IOException, InterruptedException {! int sum = 0;! for (IntWritable value : values) {! sum += value.get();! }! context.write(key, new IntWritable(sum));! }! }! ! import java.io.DataInput;! import java.io.DataOutput;! import java.io.IOException;! import java.util.List;! ! import org.apache.hadoop.io.Text;! import org.apache.hadoop.io.WritableComparable;! ! ! public class TextTriple implements WritableComparable<TextTriple> {! ! private Text first;! private Text second;! private Text third;! ! public TextTriple() {! set(new Text(), new Text(), new Text());! }! ! public TextTriple(List<String> list) {! set(new Text(list.get(0)),! new Text(list.get(1)),! new Text(list.get(2)));! }! ! public void set(Text first, Text second, Text third) {! this.first = first;! this.second = second;! this.third = third;! }! ! public void write(DataOutput out) throws IOException {! first.write(out);! second.write(out);! third.write(out);! }! ! public void readFields(DataInput in) throws IOException {! first.readFields(in);! second.readFields(in);! third.readFields(in);! }! ! @Override! public int hashCode() {! return first.hashCode() * 163 + second.hashCode() * 31 + third.hashCode();! }! ! @Override! public boolean equals(Object obj) {! if (obj instanceof TextTriple) {! TextTriple tt = (TextTriple) obj;! return first.equals(tt.first) && second.equals(tt.second) && third.equals(tt.thir }! return false;! }! ! @Override! public String toString() {! return first + "t" + second + "t" + third;! }! ! public int compareTo(TextTriple other) {! int comp = first.compareTo(other.first);! if (comp != 0) {! return comp;! }! comp = second.compareTo(other.second);! if (comp != 0) {! return comp;! }! return third.compareTo(other.third);! } ! }!
  • 21. Na>ve  Java   •  Maximum  flexibility   •  Fastest  performance   •  Na>ve  to  Hadoop   •  Most  difficult  to  write   19
  • 22. Python  implementa>on  strategies   •  Hadoop  Streaming   •  mrjob   •  dumbo   •  hadoopy   •  Hadoop  Pipes   •  pydoop   •  Non-­‐Hadoop   •  Disco   •  octopy   20
  • 23. Hadoop  Streaming:  execu>on   21 hadoop jar hadoop-streaming-2.0.0-mr1-cdh4.1.2.jar ! -input /ngrams ! -output /output-streaming ! -mapper mapper.py ! -combiner reducer.py ! -reducer reducer.py ! -jobconf stream.num.map.output.key.fields=3 ! -jobconf stream.num.reduce.output.key.fields=3 ! -jobconf mapred.reduce.tasks=10 ! -file mapper.py ! -file reducer.py!
  • 25. Hadoop  Streaming:  features   •  Canonical  method  for  using  any  executable  as   mapper/reducer   •  Includes  shell  commands,  like  grep   •  Transparent  communica>on  with  Hadoop  though   stdin/stdout   •  Key  boundaries  manually  detected  in  reducer   •  Built-­‐in  with  Hadoop:  should  require  no  addi>onal   framework  installa>on   •  Developer  must  decide  how  to  encode  more   complicated  objects  (e.g.,  JSON)  or  binary  data   23
  • 26. mrjob   24 class NgramNeighbors(MRJob):! # specify input/intermed/output serialization! # default output protocol is JSON; here we set it to text! OUTPUT_PROTOCOL = RawProtocol! ! def mapper(self, key, line):! pass! ! def combiner(self, key, counts):! pass! ! def reducer(self, key, counts):! pass! ! if __name__ == '__main__':! # sets up a runner, based on command line options! NgramNeighbors.run()! ! !
  • 27. mrjob:  runner   25 ./ngrams.py -r hadoop ! --hadoop-bin /usr/bin/hadoop ! --jobconf mapred.reduce.tasks=10 ! -o hdfs:///output-mrjob ! hdfs:///ngrams!
  • 29. mrjob:  features   •  Abstracted  MapReduce  interface   •  Handles  complex  Python  objects   •  Mul>-­‐step  MapReduce  workflows   •  Extremely  >ght  AWS  integra>on   •  Easily  choose  to  run  locally,  on  Hadoop  cluster,  or  on   EMR   •  Ac>vely  developed;  great  documenta>on   27
  • 30. mrjob:  serializa>on   28 class MyMRJob(mrjob.job.MRJob):! INPUT_PROTOCOL = mrjob.protocol.RawValueProtocol! INTERNAL_PROTOCOL = mrjob.protocol.JSONProtocol! OUTPUT_PROTOCOL = mrjob.protocol.JSONProtocol! Defaults   RawProtocol / RawValueProtocol! JSONProtocol / JSONValueProtocol! PickleProtocol / PickleValueProtocol! ReprProtocol / ReprValueProtocol! Available   Custom  protocols  can  be  wri_en.   No  current  support  for  binary  serializa>on  schemes.  
  • 31. dumbo   •  Similar  in  spirit  to  mrjob   •  abstracted   •  complex  objects   •  various  runners   •  composable  jobs   •  Sporadically  developed?   •  Documenta>on  is  a  series  of  blog  posts   29
  • 32. dumbo:  serializa>on   •  Typed  bytes  added  to  Hadoop  allowing  binary  data   •  ctypedbytes   •  binary  serializa>on   •  packs  Python  objects  in  C  structs   •  Much  faster  and  more  efficient  than  JSON  or  pickle   •  Na>vely  read  SequenceFile   •  Execute  code  from  any  Python  egg  or  JAR   •  Point  to  any  Java  InputFormat! 30
  • 33. dumbo:  installa>on  notes   •  Required  manual  install  on  each  node   •  dumbo  and  typedbytes  had  to  be  installed  as  Python   eggs   •  Had  trouble  running  a  combiner  due  to   MemoryErrors! 31
  • 34. hadoopy   •  Similar  to  dumbo,  with  be_er  docs   •  Typedbytes  serializa>on   •  Experimental  Hbase  integra>on   •  Allows  launching  python  jobs  even  on  nodes  that  do   not  have  Python   •  No  command  line  u>lity:  must  launch  MR  jobs  within   a  python  program   32
  • 35. pydoop   •  Wraps  Hadoop  Pipes  (C++  API)  instead  of  Streaming   •  HDFS  commands  communicate  through  libhdfs  rather   than  shell   •  Ability  to  implement  a  Python  Partitioner,   RecordReader,  and  RecordWriter! •  All  input/output  must  be  strings   •  Could  not  install  it   33
  • 36. luigi   •  Full-­‐fledged  workflow  management,  task  scheduling,   dependency  resolu>on  tool  in  Python  (similar  to   Apache  Oozie)   •  Built-­‐in  support  for  Hadoop  by  wrapping  Streaming   •  Not  as  fully-­‐featured  as  mrjob  for  Hadoop,  but  easily   customizable   •  Internal  serializa>on  through  repr/eval   •  Ac>vely  developed  at  Spo>fy   •  README  is  good  but  documenta>on  is  lacking   34
  • 37. luigi:  runner   35 python ngrams.py Ngrams ! --local-scheduler ! --n-reduce-tasks 10 ! --source /ngrams ! --destination /output-luigi!
  • 39. Python  frameworks  for  Hadoop   •  Hadoop  Streaming  ✓   •  mrjob  (Yelp)  ✓   •  dumbo✓   •  Luigi  (Spo>fy)  ✓   •  hadoopy✓   •  pydoop❌   •  PySpark   •  happy   •  Disco   •  octopy   •  Mortar  Data   •  Pig  UDF/Jython   •  hipy   37
  • 40. Python  frameworks  for  Hadoop   •  Hadoop  Streaming  ✓   •  mrjob  (Yelp)  ✓   •  dumbo✓   •  Luigi  (Spo>fy)  ✓   •  hadoopy✓   •  pydoop❌   •  PySpark  not  Hadoop   •  happy   •  Disco   •  octopy   •  Mortar  Data   •  Pig  UDF/Jython   •  hipy   38
  • 41. Python  frameworks  for  Hadoop   •  Hadoop  Streaming  ✓   •  mrjob  (Yelp)  ✓   •  dumbo✓   •  Luigi  (Spo>fy)  ✓   •  hadoopy✓   •  pydoop❌   •  PySpark  not  Hadoop   •  happy  abandoned?  Jython-­‐based   •  Disco   •  octopy   •  Mortar  Data   •  Pig  UDF/Jython   •  hipy   39
  • 42. Python  frameworks  for  Hadoop   •  Hadoop  Streaming  ✓   •  mrjob  (Yelp)  ✓   •  dumbo✓   •  Luigi  (Spo>fy)  ✓   •  hadoopy✓   •  pydoop❌   •  PySpark  not  Hadoop   •  happy  abandoned?  Jython-­‐based   •  Disco  not  Hadoop   •  octopy   •  Mortar  Data   •  Pig  UDF/Jython   •  hipy   40
  • 43. Python  frameworks  for  Hadoop   •  Hadoop  Streaming  ✓   •  mrjob  (Yelp)  ✓   •  dumbo✓   •  Luigi  (Spo>fy)  ✓   •  hadoopy✓   •  pydoop❌   •  PySpark  not  Hadoop   •  happy  abandoned?  Jython-­‐based   •  Disco  not  Hadoop   •  octopy  not  serious/not  Hadoop   •  Mortar  Data   •  Pig  UDF/Jython   •  hipy   41
  • 44. Python  frameworks  for  Hadoop   •  Hadoop  Streaming  ✓   •  mrjob  (Yelp)  ✓   •  dumbo✓   •  Luigi  (Spo>fy)  ✓   •  hadoopy✓   •  pydoop❌   •  PySpark  not  Hadoop   •  happy  abandoned?  Jython-­‐based   •  Disco  not  Hadoop   •  octopy  not  serious/not  Hadoop   •  Mortar  Data  HaaS;  support  numpy,  scipy,  nltk,  pip-­‐installable  in  UDF   •  Pig  UDF/Jython   •  hipy   42
  • 45. Python  frameworks  for  Hadoop   •  Hadoop  Streaming  ✓   •  mrjob  (Yelp)  ✓   •  dumbo✓   •  Luigi  (Spo>fy)  ✓   •  hadoopy✓   •  pydoop❌   •  PySpark  not  Hadoop   •  happy  abandoned?  Jython-­‐based   •  Disco  not  Hadoop   •  octopy  not  serious/not  Hadoop   •  Mortar  Data  HaaS;  support  numpy,  scipy,  nltk,  pip-­‐installable  in  UDF   •  Pig  UDF/Jython  Pig  is  another  talk;  Jython  limited   •  hipy   43
  • 46. Python  frameworks  for  Hadoop   •  Hadoop  Streaming  ✓   •  mrjob  (Yelp)  ✓   •  dumbo✓   •  Luigi  (Spo>fy)  ✓   •  hadoopy✓   •  pydoop❌   •  PySpark  not  Hadoop   •  happy  abandoned?  Jython-­‐based   •  Disco  not  Hadoop   •  octopy  not  serious/not  Hadoop   •  Mortar  Data  HaaS;  support  numpy,  scipy,  nltk,  pip-­‐installable  in  UDF   •  Pig  UDF/Jython  Pig  is  another  talk;  Jython  limited   •  hipy  Python  syntac>c  sugar  to  construct  Hive  queries   44
  • 49. The  cluster   •  5  virtual  machines   •  4  CPUs   •  10  GB  RAM   •  100  GB  disk   •  CentOS  6.2   •  CDH4  (Hadoop  2)   •  20  map  tasks   •  10  reduce  tasks   •  Python  2.6   47
  • 51. (Unscien>fic)  performance  comparison   49 Streaming  has   lowest  overhead  
  • 53. (Unscien>fic)  performance  comparison   51 Combiner  was  not  used  
  • 56. Conclusions   •  Prefer  Hadoop  Streaming  if  possible   •  It’s  easy  enough   •  Lowest  overhead   •  Prefer  mrjob  for  higher  abstrac>on   •  Ac>vely  developed/great  documenta>on   •  Feature-­‐rich  (incl.  composable  jobs)   •  Integra>on  with  AWS   •  Prefer  luigi  for  more  complicated  job  flows   •  Ac>vely  developed   •  Much  more  general  than  purely  Hadoop   54
  • 57. 55
  • 58. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/python- hadoop