Big Data & Hadoop. Simone Leo (CRS4)

Big Data MapReduce and Hadoop Pydoop
Big Data & Hadoop
Simone Leo
Distributed Computing – CRS4
http://www.crs4.it
Seminar Series 2015
Simone Leo Big Data & Hadoop

Outline
1 Big Data
2 MapReduce and Hadoop
The MapReduce Model
Hadoop
3 Pydoop
Motivation
Architecture
Usage

The Data Deluge
sciencematters.berkeley.edu
satellites
upload.wikimedia.org
social networks
www.imagenes-bio.de
DNA sequencing
scienceblogs.com
high-energy physics
Storage costs rapidly
decreasing
Easier to acquire huge
amounts of data
Social networks
Particle colliders
Radio telescopes
DNA sequencers
Wearable sensors
. . .

The Three Vs: Volume
http://www.ibmbigdatahub.com/infographic/four-vs-big-data

The Three Vs: Velocity

The Three Vs: Variety

Big Data: a Hard to Define Buzzword
OED: “data of a very large size, typically to the extent that
its manipulation and management present significant
logistical challenges”1
Wikipedia: “data sets so large or complex that traditional
data processing applications are inadequate”
datascience@berkeley blog: 43 definitions2
How big is big?
Depends on the field and organization
Varies with time
[Volume] does not fit in memory / in a hard disk
[Velocity] production faster than processing time
[Variety] unknown / unsupported formats . . .
1
http://www.forbes.com/sites/gilpress/2013/06/18/big-data-news-a-revolution-indeed
2
http://datascience.berkeley.edu/what-is-big-data

The Free Lunch is Over
CPU clock speed reached
saturation around 2004
Multi-core architectures
Everyone must go
parallel
Moore’s law reinterpreted
number of cores per
chip doubles every 2 y
clock speed remains
ﬁxed or decreases
http://www.gotw.ca/publications/concurrency-ddj.htm

Data-Intensive Applications
https://imgflip.com/memegenerator
A new class of problems
Needs speciﬁc solutions
Infrastructure
Technologies
Professional background
Scalability
Automatically adapt to
changes in size
Distributed computing

Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop
Outline
1 Big Data
The MapReduce Model
Hadoop
3 Pydoop
Motivation
Architecture
Usage

What is MapReduce?
A programming model for large-scale distributed data
processing
Inspired by map and reduce in functional programming
Map: map a set of input key/value pairs to a set of
intermediate key/value pairs
Reduce: apply a function to all values associated to the
same intermediate key; emit output key/value pairs
An implementation of a system to execute such
programs
Fault-tolerant
Hides internals from users
Scales very well with dataset size

MapReduce’s Hello World: Word Count
the quick brown fox ate the lazy green fox
Mapper Map
Reducer ReduceReducer
Shuffle &
Sort
the, 1
quick, 1
brown, 1
fox, 1
ate, 1
lazy, 1fox, 1
green, 1
quick, 1
brown, 1
fox, 2
ate, 1
the, 2
lazy, 1
green, 1
the, 1
Mapper Mapper

Wordcount: Pseudocode
map(String key, String value):
// key: does not matter in this case
// value: a subset of input words
for each word w in value:
Emit(w, "1");
reduce(String key, Iterator values):
// key: a word
// values: all values associated to key
int wordcount = 0;
for each v in values:
wordcount += ParseInt(v);
Emit(key, AsString(wordcount));

Mock Implementation – mockmr.py
from itertools import groupby
from operator import itemgetter
def _pick_last(it):
for t in it:
yield t[-1]
def mapreduce(data, mapf, redf):
buf = []
for line in data.splitlines():
for ik, iv in mapf("foo", line):
buf.append((ik, iv))
buf.sort()
for ik, values in groupby(buf, itemgetter(0)):
for ok, ov in redf(ik, _pick_last(values)):
print ok, ov

Mock Implementation – mockwc.py
from mockmr import mapreduce
DATA = """the quick brown
fox ate the
lazy green fox
"""
def map_(k, v):
for w in v.split():
yield w, 1
def reduce_(k, values):
yield k, sum(v for v in values)
if __name__ == "__main__":
mapreduce(DATA, map_, reduce_)

MapReduce: Execution Model
Mapper
Reducer
split 0
split 1
split 4
split 3
split 2
User
Program
Master
Input
(DFS)
Intermediate files
(on local disks)
Output
(DFS)
(1) fork
(2) assign
map (2) assign
reduce
(1) fork
(1) fork
(3) read
(4) local
write
(5) remote
read
(6) write
Mapper
Mapper
Reducer
adapted from Zhao et al., MapReduce: The programming model and practice, 2009 – http://research.google.com/pubs/pub36249.html

MapReduce vs Alternatives – 1
Programmingmodel
DeclarativeProcedural
Flat raw files
Data organization
Structured
MPI
DBMS/SQL
MapReduce

MapReduce vs Alternatives – 2
MPI MapReduce DBMS/SQL
programming model message passing map/reduce declarative
data organization no assumption files split into blocks organized structures
data type any (k,v) string/Avro tables with rich types
execution model independent nodes map/shuffle/reduce transaction
communication high low high
granularity fine coarse fine
usability steep learning curve simple concept runtime: hard to debug
key selling point run any application huge datasets interactive querying
There is no one-size-fits-all solution
Choose according to your problem’s characteristics

Hadoop: Overview
Disclaimer
First we describe Hadoop 1
Then we highlight what’s different in Hadoop 2
Scalable
Thousands of nodes
Petabytes of data over 10M ﬁles
Single ﬁle: Gigabytes to Terabytes
Economical
Open source
COTS Hardware (but master nodes should be reliable)
Well-suited to bag-of-tasks applications (many bio apps)
Files are split into blocks and distributed across nodes
High-throughput access to huge datasets
WORM storage model

Hadoop: Architecture
MR MasterHDFS Master
Client
Namenode Job Tracker
Task Tracker Task Tracker
Datanode Datanode Datanode
Task Tracker
Job Request
HDFS Slaves
MR Slaves
Client sends Job
request to JT
JT queries NN about
data block locations
Input stream is split
among n map tasks
Map tasks are
scheduled closest
to where data
reside

Hadoop Distributed File System (HDFS)
file name "a"
block IDs,
datanodes
a3 a2
read
data
Client Namenode
Secondary
Namenode
a1
Datanode 1 Datanode 3
Datanode 2 Datanode 6
Datanode 5
Datanode 4
Rack 1 Rack 2 Rack 3
a3 a1
a3
a2
a1 a2
namespace checkpointing
helps namenode with logs
not a namenode replacement
Large blocks
(∼100 MB)
Each block is
replicated n times
(3 by default)
One replica on the
same rack, the
others on different
racks
You have to provide
network topology

Wordcount: (part of) Java Code
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(
Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val: values) sum += val.get();
result.set(sum);
context.write(key, result);
}}

Other MapReduce Components
Combiner (local reducer)
InputFormat / RecordReader
Translates the byte-oriented view of input files into the
record-oriented view required by the Mapper
Directly accesses HDFS files
Processing unit: input split
Partitioner
Decides which Reducer receives which key
Typically uses a hash function of the key
OutputFormat / RecordWriter
Writes key/value pairs output by the Reducer
Directly accesses HDFS files

Input Formats and HDFS
HDFS
Input
Format
Job
Input split = HDFS block
An HDFS block is an actual data chunk in an HDFS ﬁle
An input split is a logical representation of a subset of the
MapReduce input dataset

Hadoop 2
HDFS Federation
Multiple namenodes can share the burden of maintaining
the namespace(s) and managing blocks
Better scalability
HDFS high availability (HA)
Can run redundant namenodes in active/passive mode
Allows failover in case of crash or planned manteinance
YARN
Resource manager: global assignment of compute
resources to applications (HA supported since 2.4.0)
Application master(s) per-application scheduling and
coordination (like a per-job JT, on a slave)
MapReduce is just one of many possible frameworks

Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage
Outline
1 Big Data
The MapReduce Model
Hadoop
3 Pydoop
Motivation
Architecture
Usage

Developing your First MapReduce Application
The easiest path for beginners is Hadoop Streaming
Java package included in Hadoop
Use any executable as the mapper or reducer
Read key-value pairs from standard input
Write them to standard output
Text protocol: records are serialized as <k>t<v>n
Word count mapper (Python)
#!/usr/bin/env python
import sys
for line in sys.stdin:
for word in line.split():
print "%st1" % word

Hadoop Streaming – Word Count Reducer (Python)
#!/usr/bin/env python
import sys
def serialize(key, value):
return "%st%d" % (key, value)
def deserialize(line):
key, value = line.split("t", 1)
return key, int(value)
def main():
prev_key, out_value = None, 0
for line in sys.stdin:
key, value = deserialize(line)
if key != prev_key:
if prev_key is not None:
print serialize(prev_key, out_value)
out_value = 0
prev_key = key
out_value += value
print serialize(key, out_value)
if __name__ == "__main__": main()

MapReduce Development with Hadoop
Java: native
C++: MapReduce API (via Hadoop Pipes) + libhdfs
(included in the Hadoop distribution)
Python: several solutions, but do they meet all of the
requirements of nontrivial apps?
Reuse existing modules, including extensions
NumPy / SciPy for numerical computation
Specialized components (RecordReader/Writer, Partitioner)
HDFS access
. . .

Python MR: Hadoop-Integrated Solutions
Hadoop Streaming
Awkward programming
style
Can only write mapper and
reducer scripts (no
RecordReader, etc.)
No HDFS access
Jython
Incomplete standard
library
Most third-party packages
are only compatible with
CPython
Cannot use C/C++
extension modules
Typically one or more
releases behind CPython

Python MR: Third Party Solutions
Most third party packages are Hadoop Streaming
wrappers
They add usability, but share many of Streaming’s
limitations
Most are not actively developed anymore
Our solution: Pydoop
(https://github.com/crs4/pydoop)
Actively developed since 2009
(Almost) pure Python Pipes client + libhdfs wrapper

Pydoop: Overview
Access to most MR components, including RecordReader,
RecordWriter and Partitioner
Get conﬁguration, set counters and report status
Event-driven programming model (like the Java API, but
with the simplicity of Python)
You deﬁne classes, the MapReduce framework instantiates
them and calls their methods
CPython → use any module
HDFS API

Summary of Features
Streaming Jython Pydoop
C/C++ Ext Yes No Yes
Standard Lib Full Partial Full
MR API No* Full Partial
Event-driven prog No Yes Yes
HDFS access No Yes Yes
(*) you can only write the map and reduce parts as executable scripts.

Hadoop Pipes
Java Hadoop Framework
PipesMapRunner PipesReducer
C++ Application
child process
record
writer
reducer
child process
record
reader
mapper
partitioner
combiner
downward
protocol
upward
protocol
downward
protocol
upward
protocol
App: separate
process
Communication with
Java framework via
persistent sockets
The C++ app
provides a factory
used by the
framework to create
MR components

Integration of Pydoop with Hadoop
Java pipes
Hadoop
Java hdfs
C libhdfs
(JNI wrapper)
(de)serialization
core ext
hdfs ext
Pure Python Modules

Python Wordcount, Full Program Code
import pydoop.mapreduce.api as api
import pydoop.mapreduce.pipes as pp
class Mapper(api.Mapper):
def map(self, context):
words = context.value.split()
for w in words:
context.emit(w, 1)
class Reducer(api.Reducer):
def reduce(self, context):
s = sum(context.values)
context.emit(context.key, s)
def __main__():
pp.run_task(pp.Factory(Mapper, Reducer))

Status Reports and Counters
import pydoop.mapreduce.api as api
class Mapper(api.Mapper):
def __init__(self, context):
super(Mapper, self).__init__(context)
context.setStatus("initializing mapper")
self.input_words = context.get_counter("WC", "INPUT_WORDS")
def map(self, context):
words = context.value.split()
for w in words:
context.emit(w, 1)
context.increment_counter(self.input_words, len(words))

Status Reports and Counters: Web UI

Optional Components: Record Reader
from pydoop.utils.serialize import serialize_to_string as to_s
import pydoop.hdfs as hdfs
class Reader(api.RecordReader):
super(Reader, self).__init__(context)
self.isplit = context.input_split
self.file = hdfs.open(self.isplit.filename)
self.file.seek(self.isplit.offset)
self.bytes_read = 0
if self.isplit.offset > 0:
discarded = self.file.readline()
self.bytes_read += len(discarded)
def next(self):
if self.bytes_read > self.isplit.length:
raise StopIteration
key = to_s(self.isplit.offset + self.bytes_read)
record = self.file.readline()
if record == "": # end of file
raise StopIteration
self.bytes_read += len(record)
return key, record

Optional Components: Record Writer, Partitioner
class Writer(api.RecordWriter):
super(Writer, self).__init__(context)
jc = context.job_conf
part = jc.get_int("mapred.task.partition")
out_dir = jc["mapred.work.output.dir"]
outfn = "%s/part-%05d" % (out_dir, part)
hdfs_user = jc.get("pydoop.hdfs.user", None)
self.file = hdfs.open(outfn, "w", user=hdfs_user)
self.sep = jc.get("mapred.textoutputformat.separator", "t")
def close(self):
self.file.close()
self.file.fs.close()
def emit(self, key, value):
self.file.write("%s%s%sn" % (key, self.sep, value))
class Partitioner(api.Partitioner):
def partition(self, key, n_red):
reducer_id = (hash(key) & sys.maxint) % n_red
return reducer_id

The HDFS Module
>>> import pydoop.hdfs as hdfs
>>> hdfs.mkdir("test")
>>> hdfs.dump("hello", "test/hello.txt")
>>> text = hdfs.load("test/hello.txt")
>>> print text
hello
>>> hdfs.ls("test")
[’hdfs://localhost:9000/user/simleo/test/hello.txt’]
>>> hdfs.stat("test/hello.txt").st_size
5L
>>> hdfs.path.isdir("test")
True
>>> hdfs.cp("test", "test.copy")
>>> with hdfs.open("test/hello.txt", "r") as fi:
... print fi.read(3)
...
hel

HDFS Usage by Block Size
import pydoop.hdfs as hdfs
ROOT = "pydoop_test_tree"
def usage_by_bs(fs, root):
stats = {}
for info in fs.walk(root):
if info["kind"] == "directory":
continue
bs = int(info["block_size"])
size = int(info["size"])
stats[bs] = stats.get(bs, 0) + size
return stats
if __name__ == "__main__":
with hdfs.hdfs() as fs:
print "BS (MB)tBYTES USED"
for k, v in sorted(usage_by_bs(fs, ROOT).iteritems()):
print "%dt%d" % (k/2**20, v)

Thank you for your time!

Big Data & Hadoop. Simone Leo (CRS4)

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Big Data & Hadoop. Simone Leo (CRS4)

Ähnlich wie Big Data & Hadoop. Simone Leo (CRS4) (20)

Mehr von CRS4 Research Center in Sardinia

Mehr von CRS4 Research Center in Sardinia (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Big Data & Hadoop. Simone Leo (CRS4)