1. Big Data MapReduce and Hadoop Pydoop
Big Data & Hadoop
Simone Leo
Distributed Computing – CRS4
http://www.crs4.it
Seminar Series 2015
Simone Leo Big Data & Hadoop
2. Big Data MapReduce and Hadoop Pydoop
Outline
1 Big Data
2 MapReduce and Hadoop
The MapReduce Model
Hadoop
3 Pydoop
Motivation
Architecture
Usage
Simone Leo Big Data & Hadoop
3. Big Data MapReduce and Hadoop Pydoop
Outline
1 Big Data
2 MapReduce and Hadoop
The MapReduce Model
Hadoop
3 Pydoop
Motivation
Architecture
Usage
Simone Leo Big Data & Hadoop
4. Big Data MapReduce and Hadoop Pydoop
The Data Deluge
sciencematters.berkeley.edu
satellites
upload.wikimedia.org
social networks
www.imagenes-bio.de
DNA sequencing
scienceblogs.com
high-energy physics
Storage costs rapidly
decreasing
Easier to acquire huge
amounts of data
Social networks
Particle colliders
Radio telescopes
DNA sequencers
Wearable sensors
. . .
Simone Leo Big Data & Hadoop
5. Big Data MapReduce and Hadoop Pydoop
The Three Vs: Volume
http://www.ibmbigdatahub.com/infographic/four-vs-big-data
Simone Leo Big Data & Hadoop
6. Big Data MapReduce and Hadoop Pydoop
The Three Vs: Velocity
http://www.ibmbigdatahub.com/infographic/four-vs-big-data
Simone Leo Big Data & Hadoop
7. Big Data MapReduce and Hadoop Pydoop
The Three Vs: Variety
http://www.ibmbigdatahub.com/infographic/four-vs-big-data
Simone Leo Big Data & Hadoop
8. Big Data MapReduce and Hadoop Pydoop
Big Data: a Hard to Define Buzzword
OED: “data of a very large size, typically to the extent that
its manipulation and management present significant
logistical challenges”1
Wikipedia: “data sets so large or complex that traditional
data processing applications are inadequate”
datascience@berkeley blog: 43 definitions2
How big is big?
Depends on the field and organization
Varies with time
[Volume] does not fit in memory / in a hard disk
[Velocity] production faster than processing time
[Variety] unknown / unsupported formats . . .
1
http://www.forbes.com/sites/gilpress/2013/06/18/big-data-news-a-revolution-indeed
2
http://datascience.berkeley.edu/what-is-big-data
Simone Leo Big Data & Hadoop
9. Big Data MapReduce and Hadoop Pydoop
The Free Lunch is Over
CPU clock speed reached
saturation around 2004
Multi-core architectures
Everyone must go
parallel
Moore’s law reinterpreted
number of cores per
chip doubles every 2 y
clock speed remains
fixed or decreases
http://www.gotw.ca/publications/concurrency-ddj.htm
Simone Leo Big Data & Hadoop
10. Big Data MapReduce and Hadoop Pydoop
Data-Intensive Applications
https://imgflip.com/memegenerator
A new class of problems
Needs specific solutions
Infrastructure
Technologies
Professional background
Scalability
Automatically adapt to
changes in size
Distributed computing
Simone Leo Big Data & Hadoop
11. Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop
Outline
1 Big Data
2 MapReduce and Hadoop
The MapReduce Model
Hadoop
3 Pydoop
Motivation
Architecture
Usage
Simone Leo Big Data & Hadoop
12. Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop
Outline
1 Big Data
2 MapReduce and Hadoop
The MapReduce Model
Hadoop
3 Pydoop
Motivation
Architecture
Usage
Simone Leo Big Data & Hadoop
13. Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop
What is MapReduce?
A programming model for large-scale distributed data
processing
Inspired by map and reduce in functional programming
Map: map a set of input key/value pairs to a set of
intermediate key/value pairs
Reduce: apply a function to all values associated to the
same intermediate key; emit output key/value pairs
An implementation of a system to execute such
programs
Fault-tolerant
Hides internals from users
Scales very well with dataset size
Simone Leo Big Data & Hadoop
14. Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop
MapReduce’s Hello World: Word Count
the quick brown fox ate the lazy green fox
Mapper Map
Reducer ReduceReducer
Shuffle &
Sort
the, 1
quick, 1
brown, 1
fox, 1
ate, 1
lazy, 1fox, 1
green, 1
quick, 1
brown, 1
fox, 2
ate, 1
the, 2
lazy, 1
green, 1
the, 1
Mapper Mapper
Simone Leo Big Data & Hadoop
15. Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop
Wordcount: Pseudocode
map(String key, String value):
// key: does not matter in this case
// value: a subset of input words
for each word w in value:
Emit(w, "1");
reduce(String key, Iterator values):
// key: a word
// values: all values associated to key
int wordcount = 0;
for each v in values:
wordcount += ParseInt(v);
Emit(key, AsString(wordcount));
Simone Leo Big Data & Hadoop
16. Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop
Mock Implementation – mockmr.py
from itertools import groupby
from operator import itemgetter
def _pick_last(it):
for t in it:
yield t[-1]
def mapreduce(data, mapf, redf):
buf = []
for line in data.splitlines():
for ik, iv in mapf("foo", line):
buf.append((ik, iv))
buf.sort()
for ik, values in groupby(buf, itemgetter(0)):
for ok, ov in redf(ik, _pick_last(values)):
print ok, ov
Simone Leo Big Data & Hadoop
17. Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop
Mock Implementation – mockwc.py
from mockmr import mapreduce
DATA = """the quick brown
fox ate the
lazy green fox
"""
def map_(k, v):
for w in v.split():
yield w, 1
def reduce_(k, values):
yield k, sum(v for v in values)
if __name__ == "__main__":
mapreduce(DATA, map_, reduce_)
Simone Leo Big Data & Hadoop
18. Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop
MapReduce: Execution Model
Mapper
Reducer
split 0
split 1
split 4
split 3
split 2
User
Program
Master
Input
(DFS)
Intermediate files
(on local disks)
Output
(DFS)
(1) fork
(2) assign
map (2) assign
reduce
(1) fork
(1) fork
(3) read
(4) local
write
(5) remote
read
(6) write
Mapper
Mapper
Reducer
adapted from Zhao et al., MapReduce: The programming model and practice, 2009 – http://research.google.com/pubs/pub36249.html
Simone Leo Big Data & Hadoop
19. Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop
MapReduce vs Alternatives – 1
Programmingmodel
DeclarativeProcedural
Flat raw files
Data organization
Structured
MPI
DBMS/SQL
MapReduce
adapted from Zhao et al., MapReduce: The programming model and practice, 2009 – http://research.google.com/pubs/pub36249.html
Simone Leo Big Data & Hadoop
20. Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop
MapReduce vs Alternatives – 2
MPI MapReduce DBMS/SQL
programming model message passing map/reduce declarative
data organization no assumption files split into blocks organized structures
data type any (k,v) string/Avro tables with rich types
execution model independent nodes map/shuffle/reduce transaction
communication high low high
granularity fine coarse fine
usability steep learning curve simple concept runtime: hard to debug
key selling point run any application huge datasets interactive querying
There is no one-size-fits-all solution
Choose according to your problem’s characteristics
adapted from Zhao et al., MapReduce: The programming model and practice, 2009 – http://research.google.com/pubs/pub36249.html
Simone Leo Big Data & Hadoop
21. Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop
Outline
1 Big Data
2 MapReduce and Hadoop
The MapReduce Model
Hadoop
3 Pydoop
Motivation
Architecture
Usage
Simone Leo Big Data & Hadoop
22. Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop
Hadoop: Overview
Disclaimer
First we describe Hadoop 1
Then we highlight what’s different in Hadoop 2
Scalable
Thousands of nodes
Petabytes of data over 10M files
Single file: Gigabytes to Terabytes
Economical
Open source
COTS Hardware (but master nodes should be reliable)
Well-suited to bag-of-tasks applications (many bio apps)
Files are split into blocks and distributed across nodes
High-throughput access to huge datasets
WORM storage model
Simone Leo Big Data & Hadoop
23. Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop
Hadoop: Architecture
MR MasterHDFS Master
Client
Namenode Job Tracker
Task Tracker Task Tracker
Datanode Datanode Datanode
Task Tracker
Job Request
HDFS Slaves
MR Slaves
Client sends Job
request to JT
JT queries NN about
data block locations
Input stream is split
among n map tasks
Map tasks are
scheduled closest
to where data
reside
Simone Leo Big Data & Hadoop
24. Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop
Hadoop Distributed File System (HDFS)
file name "a"
block IDs,
datanodes
a3 a2
read
data
Client Namenode
Secondary
Namenode
a1
Datanode 1 Datanode 3
Datanode 2 Datanode 6
Datanode 5
Datanode 4
Rack 1 Rack 2 Rack 3
a3 a1
a3
a2
a1 a2
namespace checkpointing
helps namenode with logs
not a namenode replacement
Large blocks
(∼100 MB)
Each block is
replicated n times
(3 by default)
One replica on the
same rack, the
others on different
racks
You have to provide
network topology
Simone Leo Big Data & Hadoop
25. Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop
Wordcount: (part of) Java Code
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(
Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val: values) sum += val.get();
result.set(sum);
context.write(key, result);
}}
Simone Leo Big Data & Hadoop
26. Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop
Other MapReduce Components
Combiner (local reducer)
InputFormat / RecordReader
Translates the byte-oriented view of input files into the
record-oriented view required by the Mapper
Directly accesses HDFS files
Processing unit: input split
Partitioner
Decides which Reducer receives which key
Typically uses a hash function of the key
OutputFormat / RecordWriter
Writes key/value pairs output by the Reducer
Directly accesses HDFS files
Simone Leo Big Data & Hadoop
27. Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop
Input Formats and HDFS
HDFS
Input
Format
Job
Input split = HDFS block
An HDFS block is an actual data chunk in an HDFS file
An input split is a logical representation of a subset of the
MapReduce input dataset
Simone Leo Big Data & Hadoop
28. Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop
Hadoop 2
HDFS Federation
Multiple namenodes can share the burden of maintaining
the namespace(s) and managing blocks
Better scalability
HDFS high availability (HA)
Can run redundant namenodes in active/passive mode
Allows failover in case of crash or planned manteinance
YARN
Resource manager: global assignment of compute
resources to applications (HA supported since 2.4.0)
Application master(s) per-application scheduling and
coordination (like a per-job JT, on a slave)
MapReduce is just one of many possible frameworks
Simone Leo Big Data & Hadoop
29. Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage
Outline
1 Big Data
2 MapReduce and Hadoop
The MapReduce Model
Hadoop
3 Pydoop
Motivation
Architecture
Usage
Simone Leo Big Data & Hadoop
30. Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage
Outline
1 Big Data
2 MapReduce and Hadoop
The MapReduce Model
Hadoop
3 Pydoop
Motivation
Architecture
Usage
Simone Leo Big Data & Hadoop
31. Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage
Developing your First MapReduce Application
The easiest path for beginners is Hadoop Streaming
Java package included in Hadoop
Use any executable as the mapper or reducer
Read key-value pairs from standard input
Write them to standard output
Text protocol: records are serialized as <k>t<v>n
Word count mapper (Python)
#!/usr/bin/env python
import sys
for line in sys.stdin:
for word in line.split():
print "%st1" % word
Simone Leo Big Data & Hadoop
32. Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage
Hadoop Streaming – Word Count Reducer (Python)
#!/usr/bin/env python
import sys
def serialize(key, value):
return "%st%d" % (key, value)
def deserialize(line):
key, value = line.split("t", 1)
return key, int(value)
def main():
prev_key, out_value = None, 0
for line in sys.stdin:
key, value = deserialize(line)
if key != prev_key:
if prev_key is not None:
print serialize(prev_key, out_value)
out_value = 0
prev_key = key
out_value += value
print serialize(key, out_value)
if __name__ == "__main__": main()
Simone Leo Big Data & Hadoop
33. Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage
MapReduce Development with Hadoop
Java: native
C++: MapReduce API (via Hadoop Pipes) + libhdfs
(included in the Hadoop distribution)
Python: several solutions, but do they meet all of the
requirements of nontrivial apps?
Reuse existing modules, including extensions
NumPy / SciPy for numerical computation
Specialized components (RecordReader/Writer, Partitioner)
HDFS access
. . .
Simone Leo Big Data & Hadoop
34. Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage
Python MR: Hadoop-Integrated Solutions
Hadoop Streaming
Awkward programming
style
Can only write mapper and
reducer scripts (no
RecordReader, etc.)
No HDFS access
Jython
Incomplete standard
library
Most third-party packages
are only compatible with
CPython
Cannot use C/C++
extension modules
Typically one or more
releases behind CPython
Simone Leo Big Data & Hadoop
35. Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage
Python MR: Third Party Solutions
Most third party packages are Hadoop Streaming
wrappers
They add usability, but share many of Streaming’s
limitations
Most are not actively developed anymore
Our solution: Pydoop
(https://github.com/crs4/pydoop)
Actively developed since 2009
(Almost) pure Python Pipes client + libhdfs wrapper
Simone Leo Big Data & Hadoop
36. Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage
Pydoop: Overview
Access to most MR components, including RecordReader,
RecordWriter and Partitioner
Get configuration, set counters and report status
Event-driven programming model (like the Java API, but
with the simplicity of Python)
You define classes, the MapReduce framework instantiates
them and calls their methods
CPython → use any module
HDFS API
Simone Leo Big Data & Hadoop
37. Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage
Summary of Features
Streaming Jython Pydoop
C/C++ Ext Yes No Yes
Standard Lib Full Partial Full
MR API No* Full Partial
Event-driven prog No Yes Yes
HDFS access No Yes Yes
(*) you can only write the map and reduce parts as executable scripts.
Simone Leo Big Data & Hadoop
38. Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage
Outline
1 Big Data
2 MapReduce and Hadoop
The MapReduce Model
Hadoop
3 Pydoop
Motivation
Architecture
Usage
Simone Leo Big Data & Hadoop
39. Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage
Hadoop Pipes
Java Hadoop Framework
PipesMapRunner PipesReducer
C++ Application
child process
record
writer
reducer
child process
record
reader
mapper
partitioner
combiner
downward
protocol
upward
protocol
downward
protocol
upward
protocol
App: separate
process
Communication with
Java framework via
persistent sockets
The C++ app
provides a factory
used by the
framework to create
MR components
Simone Leo Big Data & Hadoop
40. Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage
Integration of Pydoop with Hadoop
Java pipes
Hadoop
Java hdfs
C libhdfs
(JNI wrapper)
(de)serialization
core ext
hdfs ext
Pure Python Modules
Simone Leo Big Data & Hadoop
41. Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage
Outline
1 Big Data
2 MapReduce and Hadoop
The MapReduce Model
Hadoop
3 Pydoop
Motivation
Architecture
Usage
Simone Leo Big Data & Hadoop
42. Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage
Python Wordcount, Full Program Code
import pydoop.mapreduce.api as api
import pydoop.mapreduce.pipes as pp
class Mapper(api.Mapper):
def map(self, context):
words = context.value.split()
for w in words:
context.emit(w, 1)
class Reducer(api.Reducer):
def reduce(self, context):
s = sum(context.values)
context.emit(context.key, s)
def __main__():
pp.run_task(pp.Factory(Mapper, Reducer))
Simone Leo Big Data & Hadoop
43. Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage
Status Reports and Counters
import pydoop.mapreduce.api as api
class Mapper(api.Mapper):
def __init__(self, context):
super(Mapper, self).__init__(context)
context.setStatus("initializing mapper")
self.input_words = context.get_counter("WC", "INPUT_WORDS")
def map(self, context):
words = context.value.split()
for w in words:
context.emit(w, 1)
context.increment_counter(self.input_words, len(words))
Simone Leo Big Data & Hadoop
44. Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage
Status Reports and Counters: Web UI
Simone Leo Big Data & Hadoop
45. Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage
Optional Components: Record Reader
from pydoop.utils.serialize import serialize_to_string as to_s
import pydoop.hdfs as hdfs
class Reader(api.RecordReader):
def __init__(self, context):
super(Reader, self).__init__(context)
self.isplit = context.input_split
self.file = hdfs.open(self.isplit.filename)
self.file.seek(self.isplit.offset)
self.bytes_read = 0
if self.isplit.offset > 0:
discarded = self.file.readline()
self.bytes_read += len(discarded)
def next(self):
if self.bytes_read > self.isplit.length:
raise StopIteration
key = to_s(self.isplit.offset + self.bytes_read)
record = self.file.readline()
if record == "": # end of file
raise StopIteration
self.bytes_read += len(record)
return key, record
Simone Leo Big Data & Hadoop
46. Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage
Optional Components: Record Writer, Partitioner
class Writer(api.RecordWriter):
def __init__(self, context):
super(Writer, self).__init__(context)
jc = context.job_conf
part = jc.get_int("mapred.task.partition")
out_dir = jc["mapred.work.output.dir"]
outfn = "%s/part-%05d" % (out_dir, part)
hdfs_user = jc.get("pydoop.hdfs.user", None)
self.file = hdfs.open(outfn, "w", user=hdfs_user)
self.sep = jc.get("mapred.textoutputformat.separator", "t")
def close(self):
self.file.close()
self.file.fs.close()
def emit(self, key, value):
self.file.write("%s%s%sn" % (key, self.sep, value))
class Partitioner(api.Partitioner):
def partition(self, key, n_red):
reducer_id = (hash(key) & sys.maxint) % n_red
return reducer_id
Simone Leo Big Data & Hadoop
47. Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage
The HDFS Module
>>> import pydoop.hdfs as hdfs
>>> hdfs.mkdir("test")
>>> hdfs.dump("hello", "test/hello.txt")
>>> text = hdfs.load("test/hello.txt")
>>> print text
hello
>>> hdfs.ls("test")
[’hdfs://localhost:9000/user/simleo/test/hello.txt’]
>>> hdfs.stat("test/hello.txt").st_size
5L
>>> hdfs.path.isdir("test")
True
>>> hdfs.cp("test", "test.copy")
>>> with hdfs.open("test/hello.txt", "r") as fi:
... print fi.read(3)
...
hel
Simone Leo Big Data & Hadoop
48. Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage
HDFS Usage by Block Size
import pydoop.hdfs as hdfs
ROOT = "pydoop_test_tree"
def usage_by_bs(fs, root):
stats = {}
for info in fs.walk(root):
if info["kind"] == "directory":
continue
bs = int(info["block_size"])
size = int(info["size"])
stats[bs] = stats.get(bs, 0) + size
return stats
if __name__ == "__main__":
with hdfs.hdfs() as fs:
print "BS (MB)tBYTES USED"
for k, v in sorted(usage_by_bs(fs, ROOT).iteritems()):
print "%dt%d" % (k/2**20, v)
Simone Leo Big Data & Hadoop
49. Big Data MapReduce and Hadoop Pydoop
Thank you for your time!
Simone Leo Big Data & Hadoop