SlideShare ist ein Scribd-Unternehmen logo
1 von 49
Downloaden Sie, um offline zu lesen
Big Data MapReduce and Hadoop Pydoop
Big Data & Hadoop
Simone Leo
Distributed Computing – CRS4
http://www.crs4.it
Seminar Series 2015
Simone Leo Big Data & Hadoop
Big Data MapReduce and Hadoop Pydoop
Outline
1 Big Data
2 MapReduce and Hadoop
The MapReduce Model
Hadoop
3 Pydoop
Motivation
Architecture
Usage
Simone Leo Big Data & Hadoop
Big Data MapReduce and Hadoop Pydoop
Outline
1 Big Data
2 MapReduce and Hadoop
The MapReduce Model
Hadoop
3 Pydoop
Motivation
Architecture
Usage
Simone Leo Big Data & Hadoop
Big Data MapReduce and Hadoop Pydoop
The Data Deluge
sciencematters.berkeley.edu
satellites
upload.wikimedia.org
social networks
www.imagenes-bio.de
DNA sequencing
scienceblogs.com
high-energy physics
Storage costs rapidly
decreasing
Easier to acquire huge
amounts of data
Social networks
Particle colliders
Radio telescopes
DNA sequencers
Wearable sensors
. . .
Simone Leo Big Data & Hadoop
Big Data MapReduce and Hadoop Pydoop
The Three Vs: Volume
http://www.ibmbigdatahub.com/infographic/four-vs-big-data
Simone Leo Big Data & Hadoop
Big Data MapReduce and Hadoop Pydoop
The Three Vs: Velocity
http://www.ibmbigdatahub.com/infographic/four-vs-big-data
Simone Leo Big Data & Hadoop
Big Data MapReduce and Hadoop Pydoop
The Three Vs: Variety
http://www.ibmbigdatahub.com/infographic/four-vs-big-data
Simone Leo Big Data & Hadoop
Big Data MapReduce and Hadoop Pydoop
Big Data: a Hard to Define Buzzword
OED: “data of a very large size, typically to the extent that
its manipulation and management present significant
logistical challenges”1
Wikipedia: “data sets so large or complex that traditional
data processing applications are inadequate”
datascience@berkeley blog: 43 definitions2
How big is big?
Depends on the field and organization
Varies with time
[Volume] does not fit in memory / in a hard disk
[Velocity] production faster than processing time
[Variety] unknown / unsupported formats . . .
1
http://www.forbes.com/sites/gilpress/2013/06/18/big-data-news-a-revolution-indeed
2
http://datascience.berkeley.edu/what-is-big-data
Simone Leo Big Data & Hadoop
Big Data MapReduce and Hadoop Pydoop
The Free Lunch is Over
CPU clock speed reached
saturation around 2004
Multi-core architectures
Everyone must go
parallel
Moore’s law reinterpreted
number of cores per
chip doubles every 2 y
clock speed remains
fixed or decreases
http://www.gotw.ca/publications/concurrency-ddj.htm
Simone Leo Big Data & Hadoop
Big Data MapReduce and Hadoop Pydoop
Data-Intensive Applications
https://imgflip.com/memegenerator
A new class of problems
Needs specific solutions
Infrastructure
Technologies
Professional background
Scalability
Automatically adapt to
changes in size
Distributed computing
Simone Leo Big Data & Hadoop
Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop
Outline
1 Big Data
2 MapReduce and Hadoop
The MapReduce Model
Hadoop
3 Pydoop
Motivation
Architecture
Usage
Simone Leo Big Data & Hadoop
Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop
Outline
1 Big Data
2 MapReduce and Hadoop
The MapReduce Model
Hadoop
3 Pydoop
Motivation
Architecture
Usage
Simone Leo Big Data & Hadoop
Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop
What is MapReduce?
A programming model for large-scale distributed data
processing
Inspired by map and reduce in functional programming
Map: map a set of input key/value pairs to a set of
intermediate key/value pairs
Reduce: apply a function to all values associated to the
same intermediate key; emit output key/value pairs
An implementation of a system to execute such
programs
Fault-tolerant
Hides internals from users
Scales very well with dataset size
Simone Leo Big Data & Hadoop
Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop
MapReduce’s Hello World: Word Count
the quick brown fox ate the lazy green fox
Mapper Map
Reducer ReduceReducer
Shuffle &
Sort
the, 1
quick, 1
brown, 1
fox, 1
ate, 1
lazy, 1fox, 1
green, 1
quick, 1
brown, 1
fox, 2
ate, 1
the, 2
lazy, 1
green, 1
the, 1
Mapper Mapper
Simone Leo Big Data & Hadoop
Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop
Wordcount: Pseudocode
map(String key, String value):
// key: does not matter in this case
// value: a subset of input words
for each word w in value:
Emit(w, "1");
reduce(String key, Iterator values):
// key: a word
// values: all values associated to key
int wordcount = 0;
for each v in values:
wordcount += ParseInt(v);
Emit(key, AsString(wordcount));
Simone Leo Big Data & Hadoop
Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop
Mock Implementation – mockmr.py
from itertools import groupby
from operator import itemgetter
def _pick_last(it):
for t in it:
yield t[-1]
def mapreduce(data, mapf, redf):
buf = []
for line in data.splitlines():
for ik, iv in mapf("foo", line):
buf.append((ik, iv))
buf.sort()
for ik, values in groupby(buf, itemgetter(0)):
for ok, ov in redf(ik, _pick_last(values)):
print ok, ov
Simone Leo Big Data & Hadoop
Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop
Mock Implementation – mockwc.py
from mockmr import mapreduce
DATA = """the quick brown
fox ate the
lazy green fox
"""
def map_(k, v):
for w in v.split():
yield w, 1
def reduce_(k, values):
yield k, sum(v for v in values)
if __name__ == "__main__":
mapreduce(DATA, map_, reduce_)
Simone Leo Big Data & Hadoop
Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop
MapReduce: Execution Model
Mapper
Reducer
split 0
split 1
split 4
split 3
split 2
User
Program
Master
Input
(DFS)
Intermediate files
(on local disks)
Output
(DFS)
(1) fork
(2) assign
map (2) assign
reduce
(1) fork
(1) fork
(3) read
(4) local
write
(5) remote
read
(6) write
Mapper
Mapper
Reducer
adapted from Zhao et al., MapReduce: The programming model and practice, 2009 – http://research.google.com/pubs/pub36249.html
Simone Leo Big Data & Hadoop
Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop
MapReduce vs Alternatives – 1
Programmingmodel
DeclarativeProcedural
Flat raw files
Data organization
Structured
MPI
DBMS/SQL
MapReduce
adapted from Zhao et al., MapReduce: The programming model and practice, 2009 – http://research.google.com/pubs/pub36249.html
Simone Leo Big Data & Hadoop
Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop
MapReduce vs Alternatives – 2
MPI MapReduce DBMS/SQL
programming model message passing map/reduce declarative
data organization no assumption files split into blocks organized structures
data type any (k,v) string/Avro tables with rich types
execution model independent nodes map/shuffle/reduce transaction
communication high low high
granularity fine coarse fine
usability steep learning curve simple concept runtime: hard to debug
key selling point run any application huge datasets interactive querying
There is no one-size-fits-all solution
Choose according to your problem’s characteristics
adapted from Zhao et al., MapReduce: The programming model and practice, 2009 – http://research.google.com/pubs/pub36249.html
Simone Leo Big Data & Hadoop
Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop
Outline
1 Big Data
2 MapReduce and Hadoop
The MapReduce Model
Hadoop
3 Pydoop
Motivation
Architecture
Usage
Simone Leo Big Data & Hadoop
Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop
Hadoop: Overview
Disclaimer
First we describe Hadoop 1
Then we highlight what’s different in Hadoop 2
Scalable
Thousands of nodes
Petabytes of data over 10M files
Single file: Gigabytes to Terabytes
Economical
Open source
COTS Hardware (but master nodes should be reliable)
Well-suited to bag-of-tasks applications (many bio apps)
Files are split into blocks and distributed across nodes
High-throughput access to huge datasets
WORM storage model
Simone Leo Big Data & Hadoop
Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop
Hadoop: Architecture
MR MasterHDFS Master
Client
Namenode Job Tracker
Task Tracker Task Tracker
Datanode Datanode Datanode
Task Tracker
Job Request
HDFS Slaves
MR Slaves
Client sends Job
request to JT
JT queries NN about
data block locations
Input stream is split
among n map tasks
Map tasks are
scheduled closest
to where data
reside
Simone Leo Big Data & Hadoop
Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop
Hadoop Distributed File System (HDFS)
file name "a"
block IDs,
datanodes
a3 a2
read
data
Client Namenode
Secondary
Namenode
a1
Datanode 1 Datanode 3
Datanode 2 Datanode 6
Datanode 5
Datanode 4
Rack 1 Rack 2 Rack 3
a3 a1
a3
a2
a1 a2
namespace checkpointing
helps namenode with logs
not a namenode replacement
Large blocks
(∼100 MB)
Each block is
replicated n times
(3 by default)
One replica on the
same rack, the
others on different
racks
You have to provide
network topology
Simone Leo Big Data & Hadoop
Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop
Wordcount: (part of) Java Code
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(
Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val: values) sum += val.get();
result.set(sum);
context.write(key, result);
}}
Simone Leo Big Data & Hadoop
Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop
Other MapReduce Components
Combiner (local reducer)
InputFormat / RecordReader
Translates the byte-oriented view of input files into the
record-oriented view required by the Mapper
Directly accesses HDFS files
Processing unit: input split
Partitioner
Decides which Reducer receives which key
Typically uses a hash function of the key
OutputFormat / RecordWriter
Writes key/value pairs output by the Reducer
Directly accesses HDFS files
Simone Leo Big Data & Hadoop
Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop
Input Formats and HDFS
HDFS
Input
Format
Job
Input split = HDFS block
An HDFS block is an actual data chunk in an HDFS file
An input split is a logical representation of a subset of the
MapReduce input dataset
Simone Leo Big Data & Hadoop
Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop
Hadoop 2
HDFS Federation
Multiple namenodes can share the burden of maintaining
the namespace(s) and managing blocks
Better scalability
HDFS high availability (HA)
Can run redundant namenodes in active/passive mode
Allows failover in case of crash or planned manteinance
YARN
Resource manager: global assignment of compute
resources to applications (HA supported since 2.4.0)
Application master(s) per-application scheduling and
coordination (like a per-job JT, on a slave)
MapReduce is just one of many possible frameworks
Simone Leo Big Data & Hadoop
Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage
Outline
1 Big Data
2 MapReduce and Hadoop
The MapReduce Model
Hadoop
3 Pydoop
Motivation
Architecture
Usage
Simone Leo Big Data & Hadoop
Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage
Outline
1 Big Data
2 MapReduce and Hadoop
The MapReduce Model
Hadoop
3 Pydoop
Motivation
Architecture
Usage
Simone Leo Big Data & Hadoop
Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage
Developing your First MapReduce Application
The easiest path for beginners is Hadoop Streaming
Java package included in Hadoop
Use any executable as the mapper or reducer
Read key-value pairs from standard input
Write them to standard output
Text protocol: records are serialized as <k>t<v>n
Word count mapper (Python)
#!/usr/bin/env python
import sys
for line in sys.stdin:
for word in line.split():
print "%st1" % word
Simone Leo Big Data & Hadoop
Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage
Hadoop Streaming – Word Count Reducer (Python)
#!/usr/bin/env python
import sys
def serialize(key, value):
return "%st%d" % (key, value)
def deserialize(line):
key, value = line.split("t", 1)
return key, int(value)
def main():
prev_key, out_value = None, 0
for line in sys.stdin:
key, value = deserialize(line)
if key != prev_key:
if prev_key is not None:
print serialize(prev_key, out_value)
out_value = 0
prev_key = key
out_value += value
print serialize(key, out_value)
if __name__ == "__main__": main()
Simone Leo Big Data & Hadoop
Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage
MapReduce Development with Hadoop
Java: native
C++: MapReduce API (via Hadoop Pipes) + libhdfs
(included in the Hadoop distribution)
Python: several solutions, but do they meet all of the
requirements of nontrivial apps?
Reuse existing modules, including extensions
NumPy / SciPy for numerical computation
Specialized components (RecordReader/Writer, Partitioner)
HDFS access
. . .
Simone Leo Big Data & Hadoop
Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage
Python MR: Hadoop-Integrated Solutions
Hadoop Streaming
Awkward programming
style
Can only write mapper and
reducer scripts (no
RecordReader, etc.)
No HDFS access
Jython
Incomplete standard
library
Most third-party packages
are only compatible with
CPython
Cannot use C/C++
extension modules
Typically one or more
releases behind CPython
Simone Leo Big Data & Hadoop
Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage
Python MR: Third Party Solutions
Most third party packages are Hadoop Streaming
wrappers
They add usability, but share many of Streaming’s
limitations
Most are not actively developed anymore
Our solution: Pydoop
(https://github.com/crs4/pydoop)
Actively developed since 2009
(Almost) pure Python Pipes client + libhdfs wrapper
Simone Leo Big Data & Hadoop
Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage
Pydoop: Overview
Access to most MR components, including RecordReader,
RecordWriter and Partitioner
Get configuration, set counters and report status
Event-driven programming model (like the Java API, but
with the simplicity of Python)
You define classes, the MapReduce framework instantiates
them and calls their methods
CPython → use any module
HDFS API
Simone Leo Big Data & Hadoop
Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage
Summary of Features
Streaming Jython Pydoop
C/C++ Ext Yes No Yes
Standard Lib Full Partial Full
MR API No* Full Partial
Event-driven prog No Yes Yes
HDFS access No Yes Yes
(*) you can only write the map and reduce parts as executable scripts.
Simone Leo Big Data & Hadoop
Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage
Outline
1 Big Data
2 MapReduce and Hadoop
The MapReduce Model
Hadoop
3 Pydoop
Motivation
Architecture
Usage
Simone Leo Big Data & Hadoop
Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage
Hadoop Pipes
Java Hadoop Framework
PipesMapRunner PipesReducer
C++ Application
child process
record
writer
reducer
child process
record
reader
mapper
partitioner
combiner
downward
protocol
upward
protocol
downward
protocol
upward
protocol
App: separate
process
Communication with
Java framework via
persistent sockets
The C++ app
provides a factory
used by the
framework to create
MR components
Simone Leo Big Data & Hadoop
Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage
Integration of Pydoop with Hadoop
Java pipes
Hadoop
Java hdfs
C libhdfs
(JNI wrapper)
(de)serialization
core ext
hdfs ext
Pure Python Modules
Simone Leo Big Data & Hadoop
Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage
Outline
1 Big Data
2 MapReduce and Hadoop
The MapReduce Model
Hadoop
3 Pydoop
Motivation
Architecture
Usage
Simone Leo Big Data & Hadoop
Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage
Python Wordcount, Full Program Code
import pydoop.mapreduce.api as api
import pydoop.mapreduce.pipes as pp
class Mapper(api.Mapper):
def map(self, context):
words = context.value.split()
for w in words:
context.emit(w, 1)
class Reducer(api.Reducer):
def reduce(self, context):
s = sum(context.values)
context.emit(context.key, s)
def __main__():
pp.run_task(pp.Factory(Mapper, Reducer))
Simone Leo Big Data & Hadoop
Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage
Status Reports and Counters
import pydoop.mapreduce.api as api
class Mapper(api.Mapper):
def __init__(self, context):
super(Mapper, self).__init__(context)
context.setStatus("initializing mapper")
self.input_words = context.get_counter("WC", "INPUT_WORDS")
def map(self, context):
words = context.value.split()
for w in words:
context.emit(w, 1)
context.increment_counter(self.input_words, len(words))
Simone Leo Big Data & Hadoop
Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage
Status Reports and Counters: Web UI
Simone Leo Big Data & Hadoop
Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage
Optional Components: Record Reader
from pydoop.utils.serialize import serialize_to_string as to_s
import pydoop.hdfs as hdfs
class Reader(api.RecordReader):
def __init__(self, context):
super(Reader, self).__init__(context)
self.isplit = context.input_split
self.file = hdfs.open(self.isplit.filename)
self.file.seek(self.isplit.offset)
self.bytes_read = 0
if self.isplit.offset > 0:
discarded = self.file.readline()
self.bytes_read += len(discarded)
def next(self):
if self.bytes_read > self.isplit.length:
raise StopIteration
key = to_s(self.isplit.offset + self.bytes_read)
record = self.file.readline()
if record == "": # end of file
raise StopIteration
self.bytes_read += len(record)
return key, record
Simone Leo Big Data & Hadoop
Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage
Optional Components: Record Writer, Partitioner
class Writer(api.RecordWriter):
def __init__(self, context):
super(Writer, self).__init__(context)
jc = context.job_conf
part = jc.get_int("mapred.task.partition")
out_dir = jc["mapred.work.output.dir"]
outfn = "%s/part-%05d" % (out_dir, part)
hdfs_user = jc.get("pydoop.hdfs.user", None)
self.file = hdfs.open(outfn, "w", user=hdfs_user)
self.sep = jc.get("mapred.textoutputformat.separator", "t")
def close(self):
self.file.close()
self.file.fs.close()
def emit(self, key, value):
self.file.write("%s%s%sn" % (key, self.sep, value))
class Partitioner(api.Partitioner):
def partition(self, key, n_red):
reducer_id = (hash(key) & sys.maxint) % n_red
return reducer_id
Simone Leo Big Data & Hadoop
Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage
The HDFS Module
>>> import pydoop.hdfs as hdfs
>>> hdfs.mkdir("test")
>>> hdfs.dump("hello", "test/hello.txt")
>>> text = hdfs.load("test/hello.txt")
>>> print text
hello
>>> hdfs.ls("test")
[’hdfs://localhost:9000/user/simleo/test/hello.txt’]
>>> hdfs.stat("test/hello.txt").st_size
5L
>>> hdfs.path.isdir("test")
True
>>> hdfs.cp("test", "test.copy")
>>> with hdfs.open("test/hello.txt", "r") as fi:
... print fi.read(3)
...
hel
Simone Leo Big Data & Hadoop
Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage
HDFS Usage by Block Size
import pydoop.hdfs as hdfs
ROOT = "pydoop_test_tree"
def usage_by_bs(fs, root):
stats = {}
for info in fs.walk(root):
if info["kind"] == "directory":
continue
bs = int(info["block_size"])
size = int(info["size"])
stats[bs] = stats.get(bs, 0) + size
return stats
if __name__ == "__main__":
with hdfs.hdfs() as fs:
print "BS (MB)tBYTES USED"
for k, v in sorted(usage_by_bs(fs, ROOT).iteritems()):
print "%dt%d" % (k/2**20, v)
Simone Leo Big Data & Hadoop
Big Data MapReduce and Hadoop Pydoop
Thank you for your time!
Simone Leo Big Data & Hadoop

Weitere ähnliche Inhalte

Was ist angesagt?

Qiu bosc2010
Qiu bosc2010Qiu bosc2010
Qiu bosc2010BOSC 2010
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)PyData
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopVictoria López
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19Ahmed Elsayed
 
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkA Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkIRJET Journal
 
Goto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedGoto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedTed Dunning
 
Storm users group real time hadoop
Storm users group real time hadoopStorm users group real time hadoop
Storm users group real time hadoopTed Dunning
 
Power of Python with Big Data
Power of Python with Big DataPower of Python with Big Data
Power of Python with Big DataEdureka!
 
Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill MapR Technologies
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2Giovanna Roda
 
What is Hadoop?
What is Hadoop?What is Hadoop?
What is Hadoop?cneudecker
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataAlbert Bifet
 
Approximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processingApproximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processingGabriele Modena
 
Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Gabriele Modena
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopRevolution Analytics
 
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabBeyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabVijay Srinivas Agneeswaran, Ph.D
 

Was ist angesagt? (20)

Qiu bosc2010
Qiu bosc2010Qiu bosc2010
Qiu bosc2010
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkA Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
 
Goto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedGoto amsterdam-2013-skinned
Goto amsterdam-2013-skinned
 
Pig Experience
Pig ExperiencePig Experience
Pig Experience
 
Storm users group real time hadoop
Storm users group real time hadoopStorm users group real time hadoop
Storm users group real time hadoop
 
Power of Python with Big Data
Power of Python with Big DataPower of Python with Big Data
Power of Python with Big Data
 
Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2
 
What is Hadoop?
What is Hadoop?What is Hadoop?
What is Hadoop?
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Approximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processingApproximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processing
 
Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Eg4301808811
Eg4301808811Eg4301808811
Eg4301808811
 
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabBeyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
 
London hug
London hugLondon hug
London hug
 

Ähnlich wie Big Data & Hadoop. Simone Leo (CRS4)

Intro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and MapreduceIntro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and MapreduceKrishna Sangeeth KS
 
IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET Journal
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridEvert Lammerts
 
Python in big data world
Python in big data worldPython in big data world
Python in big data worldRohit
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analyticsAvinash Pandu
 
A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniquesijsrd.com
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
 
Big data with hadoop
Big data with hadoopBig data with hadoop
Big data with hadoopAnusha sweety
 
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune amrutupre
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATarak Tar
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATarak Tar
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data trainingagiamas
 

Ähnlich wie Big Data & Hadoop. Simone Leo (CRS4) (20)

Intro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and MapreduceIntro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and Mapreduce
 
IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOP
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
BIG DATA
BIG DATABIG DATA
BIG DATA
 
Python in big data world
Python in big data worldPython in big data world
Python in big data world
 
B04 06 0918
B04 06 0918B04 06 0918
B04 06 0918
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics
 
A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniques
 
Hadoop Cluster Analysis and Assessment
Hadoop Cluster Analysis and AssessmentHadoop Cluster Analysis and Assessment
Hadoop Cluster Analysis and Assessment
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 
Big data with hadoop
Big data with hadoopBig data with hadoop
Big data with hadoop
 
IJET-V2I6P25
IJET-V2I6P25IJET-V2I6P25
IJET-V2I6P25
 
B04 06 0918
B04 06 0918B04 06 0918
B04 06 0918
 
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
 
Big data
Big dataBig data
Big data
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data training
 

Mehr von CRS4 Research Center in Sardinia

Sequenziamento Esomico. Maria Valentini (CRS4), Cagliari, 18 Novembre 2015
Sequenziamento Esomico. Maria Valentini (CRS4), Cagliari, 18 Novembre 2015Sequenziamento Esomico. Maria Valentini (CRS4), Cagliari, 18 Novembre 2015
Sequenziamento Esomico. Maria Valentini (CRS4), Cagliari, 18 Novembre 2015CRS4 Research Center in Sardinia
 
Near Surface Geoscience Conference 2015, Turin - A Spatial Velocity Analysis ...
Near Surface Geoscience Conference 2015, Turin - A Spatial Velocity Analysis ...Near Surface Geoscience Conference 2015, Turin - A Spatial Velocity Analysis ...
Near Surface Geoscience Conference 2015, Turin - A Spatial Velocity Analysis ...CRS4 Research Center in Sardinia
 
GIS partecipativo. Laura Muscas e Valentina Spanu (CRS4), Cagliari, 21 Ottobr...
GIS partecipativo. Laura Muscas e Valentina Spanu (CRS4), Cagliari, 21 Ottobr...GIS partecipativo. Laura Muscas e Valentina Spanu (CRS4), Cagliari, 21 Ottobr...
GIS partecipativo. Laura Muscas e Valentina Spanu (CRS4), Cagliari, 21 Ottobr...CRS4 Research Center in Sardinia
 
Alfonso Damiano (Università di Cagliari) ICT per Smart Grid
Alfonso Damiano (Università di Cagliari) ICT per Smart Grid Alfonso Damiano (Università di Cagliari) ICT per Smart Grid
Alfonso Damiano (Università di Cagliari) ICT per Smart Grid CRS4 Research Center in Sardinia
 
Dinamica Molecolare e Modellistica dell'interazione di lipidi col recettore P...
Dinamica Molecolare e Modellistica dell'interazione di lipidi col recettore P...Dinamica Molecolare e Modellistica dell'interazione di lipidi col recettore P...
Dinamica Molecolare e Modellistica dell'interazione di lipidi col recettore P...CRS4 Research Center in Sardinia
 
Innovazione e infrastrutture cloud per lo sviluppo di applicativi web e mobil...
Innovazione e infrastrutture cloud per lo sviluppo di applicativi web e mobil...Innovazione e infrastrutture cloud per lo sviluppo di applicativi web e mobil...
Innovazione e infrastrutture cloud per lo sviluppo di applicativi web e mobil...CRS4 Research Center in Sardinia
 
ORDBMS e NoSQL nel trattamento dei dati geografici parte seconda. 30 Sett. 2015
ORDBMS e NoSQL nel trattamento dei dati geografici parte seconda. 30 Sett. 2015ORDBMS e NoSQL nel trattamento dei dati geografici parte seconda. 30 Sett. 2015
ORDBMS e NoSQL nel trattamento dei dati geografici parte seconda. 30 Sett. 2015CRS4 Research Center in Sardinia
 
Sistemi No-Sql e Object-Relational nella gestione dei dati geografici 30 Sett...
Sistemi No-Sql e Object-Relational nella gestione dei dati geografici 30 Sett...Sistemi No-Sql e Object-Relational nella gestione dei dati geografici 30 Sett...
Sistemi No-Sql e Object-Relational nella gestione dei dati geografici 30 Sett...CRS4 Research Center in Sardinia
 
Elementi di sismica a riflessione e Georadar (Gian Piero Deidda, UNICA)
Elementi di sismica a riflessione e Georadar (Gian Piero Deidda, UNICA)Elementi di sismica a riflessione e Georadar (Gian Piero Deidda, UNICA)
Elementi di sismica a riflessione e Georadar (Gian Piero Deidda, UNICA)CRS4 Research Center in Sardinia
 
Near Surface Geoscience Conference 2014, Athens - Real-­time or full­‐precisi...
Near Surface Geoscience Conference 2014, Athens - Real-­time or full­‐precisi...Near Surface Geoscience Conference 2014, Athens - Real-­time or full­‐precisi...
Near Surface Geoscience Conference 2014, Athens - Real-­time or full­‐precisi...CRS4 Research Center in Sardinia
 
Luigi Atzori Metabolomica: Introduzione e review di alcune applicazioni in am...
Luigi Atzori Metabolomica: Introduzione e review di alcune applicazioni in am...Luigi Atzori Metabolomica: Introduzione e review di alcune applicazioni in am...
Luigi Atzori Metabolomica: Introduzione e review di alcune applicazioni in am...CRS4 Research Center in Sardinia
 

Mehr von CRS4 Research Center in Sardinia (20)

The future is close
The future is closeThe future is close
The future is close
 
The future is close
The future is closeThe future is close
The future is close
 
Presentazione Linea B2 progetto Tutti a Iscol@ 2017
Presentazione Linea B2 progetto Tutti a Iscol@ 2017Presentazione Linea B2 progetto Tutti a Iscol@ 2017
Presentazione Linea B2 progetto Tutti a Iscol@ 2017
 
Iscola linea B 2016
Iscola linea B 2016Iscola linea B 2016
Iscola linea B 2016
 
Sequenziamento Esomico. Maria Valentini (CRS4), Cagliari, 18 Novembre 2015
Sequenziamento Esomico. Maria Valentini (CRS4), Cagliari, 18 Novembre 2015Sequenziamento Esomico. Maria Valentini (CRS4), Cagliari, 18 Novembre 2015
Sequenziamento Esomico. Maria Valentini (CRS4), Cagliari, 18 Novembre 2015
 
Near Surface Geoscience Conference 2015, Turin - A Spatial Velocity Analysis ...
Near Surface Geoscience Conference 2015, Turin - A Spatial Velocity Analysis ...Near Surface Geoscience Conference 2015, Turin - A Spatial Velocity Analysis ...
Near Surface Geoscience Conference 2015, Turin - A Spatial Velocity Analysis ...
 
GIS partecipativo. Laura Muscas e Valentina Spanu (CRS4), Cagliari, 21 Ottobr...
GIS partecipativo. Laura Muscas e Valentina Spanu (CRS4), Cagliari, 21 Ottobr...GIS partecipativo. Laura Muscas e Valentina Spanu (CRS4), Cagliari, 21 Ottobr...
GIS partecipativo. Laura Muscas e Valentina Spanu (CRS4), Cagliari, 21 Ottobr...
 
Alfonso Damiano (Università di Cagliari) ICT per Smart Grid
Alfonso Damiano (Università di Cagliari) ICT per Smart Grid Alfonso Damiano (Università di Cagliari) ICT per Smart Grid
Alfonso Damiano (Università di Cagliari) ICT per Smart Grid
 
Big Data Infrastructures - Hadoop ecosystem, M. E. Piras
Big Data Infrastructures - Hadoop ecosystem, M. E. PirasBig Data Infrastructures - Hadoop ecosystem, M. E. Piras
Big Data Infrastructures - Hadoop ecosystem, M. E. Piras
 
Big Data Analytics, Giovanni Delussu e Marco Enrico Piras
 Big Data Analytics, Giovanni Delussu e Marco Enrico Piras  Big Data Analytics, Giovanni Delussu e Marco Enrico Piras
Big Data Analytics, Giovanni Delussu e Marco Enrico Piras
 
Dinamica Molecolare e Modellistica dell'interazione di lipidi col recettore P...
Dinamica Molecolare e Modellistica dell'interazione di lipidi col recettore P...Dinamica Molecolare e Modellistica dell'interazione di lipidi col recettore P...
Dinamica Molecolare e Modellistica dell'interazione di lipidi col recettore P...
 
Innovazione e infrastrutture cloud per lo sviluppo di applicativi web e mobil...
Innovazione e infrastrutture cloud per lo sviluppo di applicativi web e mobil...Innovazione e infrastrutture cloud per lo sviluppo di applicativi web e mobil...
Innovazione e infrastrutture cloud per lo sviluppo di applicativi web e mobil...
 
ORDBMS e NoSQL nel trattamento dei dati geografici parte seconda. 30 Sett. 2015
ORDBMS e NoSQL nel trattamento dei dati geografici parte seconda. 30 Sett. 2015ORDBMS e NoSQL nel trattamento dei dati geografici parte seconda. 30 Sett. 2015
ORDBMS e NoSQL nel trattamento dei dati geografici parte seconda. 30 Sett. 2015
 
Sistemi No-Sql e Object-Relational nella gestione dei dati geografici 30 Sett...
Sistemi No-Sql e Object-Relational nella gestione dei dati geografici 30 Sett...Sistemi No-Sql e Object-Relational nella gestione dei dati geografici 30 Sett...
Sistemi No-Sql e Object-Relational nella gestione dei dati geografici 30 Sett...
 
Elementi di sismica a riflessione e Georadar (Gian Piero Deidda, UNICA)
Elementi di sismica a riflessione e Georadar (Gian Piero Deidda, UNICA)Elementi di sismica a riflessione e Georadar (Gian Piero Deidda, UNICA)
Elementi di sismica a riflessione e Georadar (Gian Piero Deidda, UNICA)
 
Near Surface Geoscience Conference 2014, Athens - Real-­time or full­‐precisi...
Near Surface Geoscience Conference 2014, Athens - Real-­time or full­‐precisi...Near Surface Geoscience Conference 2014, Athens - Real-­time or full­‐precisi...
Near Surface Geoscience Conference 2014, Athens - Real-­time or full­‐precisi...
 
SmartGeo/Eiagrid portal (Guido Satta, CRS4)
SmartGeo/Eiagrid portal (Guido Satta, CRS4)SmartGeo/Eiagrid portal (Guido Satta, CRS4)
SmartGeo/Eiagrid portal (Guido Satta, CRS4)
 
Luigi Atzori Metabolomica: Introduzione e review di alcune applicazioni in am...
Luigi Atzori Metabolomica: Introduzione e review di alcune applicazioni in am...Luigi Atzori Metabolomica: Introduzione e review di alcune applicazioni in am...
Luigi Atzori Metabolomica: Introduzione e review di alcune applicazioni in am...
 
Mobile Graphics (part2)
Mobile Graphics (part2)Mobile Graphics (part2)
Mobile Graphics (part2)
 
Mobile Graphics (part1)
Mobile Graphics (part1)Mobile Graphics (part1)
Mobile Graphics (part1)
 

Kürzlich hochgeladen

6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
6.2 Pests of Sesame_Identification_Binomics_Dr.UPRPirithiRaju
 
Measures of Central Tendency.pptx for UG
Measures of Central Tendency.pptx for UGMeasures of Central Tendency.pptx for UG
Measures of Central Tendency.pptx for UGSoniaBajaj10
 
Timeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
Timeless Cosmology: Towards a Geometric Origin of Cosmological CorrelationsTimeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
Timeless Cosmology: Towards a Geometric Origin of Cosmological CorrelationsDanielBaumann11
 
ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...
ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...
ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...Chayanika Das
 
Pests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPirithiRaju
 
Science (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and PitfallsScience (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and PitfallsDobusch Leonhard
 
DNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxDNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxGiDMOh
 
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep LearningCombining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learningvschiavoni
 
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...Christina Parmionova
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptxpallavirawat456
 
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPRPirithiRaju
 
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests GlycosidesGLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests GlycosidesNandakishor Bhaurao Deshmukh
 
Oxo-Acids of Halogens and their Salts.pptx
Oxo-Acids of Halogens and their Salts.pptxOxo-Acids of Halogens and their Salts.pptx
Oxo-Acids of Halogens and their Salts.pptxfarhanvvdk
 
Total Legal: A “Joint” Journey into the Chemistry of Cannabinoids
Total Legal: A “Joint” Journey into the Chemistry of CannabinoidsTotal Legal: A “Joint” Journey into the Chemistry of Cannabinoids
Total Legal: A “Joint” Journey into the Chemistry of CannabinoidsMarkus Roggen
 
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdfKDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdfGABYFIORELAMALPARTID1
 
dll general biology week 1 - Copy.docx
dll general biology   week 1 - Copy.docxdll general biology   week 1 - Copy.docx
dll general biology week 1 - Copy.docxkarenmillo
 
DETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptxDETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptx201bo007
 
Abnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptxAbnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptxzeus70441
 

Kürzlich hochgeladen (20)

6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
 
Measures of Central Tendency.pptx for UG
Measures of Central Tendency.pptx for UGMeasures of Central Tendency.pptx for UG
Measures of Central Tendency.pptx for UG
 
Timeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
Timeless Cosmology: Towards a Geometric Origin of Cosmological CorrelationsTimeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
Timeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
 
ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...
ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...
ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...
 
Pests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPR
 
Science (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and PitfallsScience (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and Pitfalls
 
DNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxDNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptx
 
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep LearningCombining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
 
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptx
 
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
 
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests GlycosidesGLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
 
Oxo-Acids of Halogens and their Salts.pptx
Oxo-Acids of Halogens and their Salts.pptxOxo-Acids of Halogens and their Salts.pptx
Oxo-Acids of Halogens and their Salts.pptx
 
Total Legal: A “Joint” Journey into the Chemistry of Cannabinoids
Total Legal: A “Joint” Journey into the Chemistry of CannabinoidsTotal Legal: A “Joint” Journey into the Chemistry of Cannabinoids
Total Legal: A “Joint” Journey into the Chemistry of Cannabinoids
 
AZOTOBACTER AS BIOFERILIZER.PPTX
AZOTOBACTER AS BIOFERILIZER.PPTXAZOTOBACTER AS BIOFERILIZER.PPTX
AZOTOBACTER AS BIOFERILIZER.PPTX
 
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdfKDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
 
dll general biology week 1 - Copy.docx
dll general biology   week 1 - Copy.docxdll general biology   week 1 - Copy.docx
dll general biology week 1 - Copy.docx
 
DETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptxDETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptx
 
Interferons.pptx.
Interferons.pptx.Interferons.pptx.
Interferons.pptx.
 
Abnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptxAbnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptx
 

Big Data & Hadoop. Simone Leo (CRS4)

  • 1. Big Data MapReduce and Hadoop Pydoop Big Data & Hadoop Simone Leo Distributed Computing – CRS4 http://www.crs4.it Seminar Series 2015 Simone Leo Big Data & Hadoop
  • 2. Big Data MapReduce and Hadoop Pydoop Outline 1 Big Data 2 MapReduce and Hadoop The MapReduce Model Hadoop 3 Pydoop Motivation Architecture Usage Simone Leo Big Data & Hadoop
  • 3. Big Data MapReduce and Hadoop Pydoop Outline 1 Big Data 2 MapReduce and Hadoop The MapReduce Model Hadoop 3 Pydoop Motivation Architecture Usage Simone Leo Big Data & Hadoop
  • 4. Big Data MapReduce and Hadoop Pydoop The Data Deluge sciencematters.berkeley.edu satellites upload.wikimedia.org social networks www.imagenes-bio.de DNA sequencing scienceblogs.com high-energy physics Storage costs rapidly decreasing Easier to acquire huge amounts of data Social networks Particle colliders Radio telescopes DNA sequencers Wearable sensors . . . Simone Leo Big Data & Hadoop
  • 5. Big Data MapReduce and Hadoop Pydoop The Three Vs: Volume http://www.ibmbigdatahub.com/infographic/four-vs-big-data Simone Leo Big Data & Hadoop
  • 6. Big Data MapReduce and Hadoop Pydoop The Three Vs: Velocity http://www.ibmbigdatahub.com/infographic/four-vs-big-data Simone Leo Big Data & Hadoop
  • 7. Big Data MapReduce and Hadoop Pydoop The Three Vs: Variety http://www.ibmbigdatahub.com/infographic/four-vs-big-data Simone Leo Big Data & Hadoop
  • 8. Big Data MapReduce and Hadoop Pydoop Big Data: a Hard to Define Buzzword OED: “data of a very large size, typically to the extent that its manipulation and management present significant logistical challenges”1 Wikipedia: “data sets so large or complex that traditional data processing applications are inadequate” datascience@berkeley blog: 43 definitions2 How big is big? Depends on the field and organization Varies with time [Volume] does not fit in memory / in a hard disk [Velocity] production faster than processing time [Variety] unknown / unsupported formats . . . 1 http://www.forbes.com/sites/gilpress/2013/06/18/big-data-news-a-revolution-indeed 2 http://datascience.berkeley.edu/what-is-big-data Simone Leo Big Data & Hadoop
  • 9. Big Data MapReduce and Hadoop Pydoop The Free Lunch is Over CPU clock speed reached saturation around 2004 Multi-core architectures Everyone must go parallel Moore’s law reinterpreted number of cores per chip doubles every 2 y clock speed remains fixed or decreases http://www.gotw.ca/publications/concurrency-ddj.htm Simone Leo Big Data & Hadoop
  • 10. Big Data MapReduce and Hadoop Pydoop Data-Intensive Applications https://imgflip.com/memegenerator A new class of problems Needs specific solutions Infrastructure Technologies Professional background Scalability Automatically adapt to changes in size Distributed computing Simone Leo Big Data & Hadoop
  • 11. Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop Outline 1 Big Data 2 MapReduce and Hadoop The MapReduce Model Hadoop 3 Pydoop Motivation Architecture Usage Simone Leo Big Data & Hadoop
  • 12. Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop Outline 1 Big Data 2 MapReduce and Hadoop The MapReduce Model Hadoop 3 Pydoop Motivation Architecture Usage Simone Leo Big Data & Hadoop
  • 13. Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop What is MapReduce? A programming model for large-scale distributed data processing Inspired by map and reduce in functional programming Map: map a set of input key/value pairs to a set of intermediate key/value pairs Reduce: apply a function to all values associated to the same intermediate key; emit output key/value pairs An implementation of a system to execute such programs Fault-tolerant Hides internals from users Scales very well with dataset size Simone Leo Big Data & Hadoop
  • 14. Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop MapReduce’s Hello World: Word Count the quick brown fox ate the lazy green fox Mapper Map Reducer ReduceReducer Shuffle & Sort the, 1 quick, 1 brown, 1 fox, 1 ate, 1 lazy, 1fox, 1 green, 1 quick, 1 brown, 1 fox, 2 ate, 1 the, 2 lazy, 1 green, 1 the, 1 Mapper Mapper Simone Leo Big Data & Hadoop
  • 15. Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop Wordcount: Pseudocode map(String key, String value): // key: does not matter in this case // value: a subset of input words for each word w in value: Emit(w, "1"); reduce(String key, Iterator values): // key: a word // values: all values associated to key int wordcount = 0; for each v in values: wordcount += ParseInt(v); Emit(key, AsString(wordcount)); Simone Leo Big Data & Hadoop
  • 16. Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop Mock Implementation – mockmr.py from itertools import groupby from operator import itemgetter def _pick_last(it): for t in it: yield t[-1] def mapreduce(data, mapf, redf): buf = [] for line in data.splitlines(): for ik, iv in mapf("foo", line): buf.append((ik, iv)) buf.sort() for ik, values in groupby(buf, itemgetter(0)): for ok, ov in redf(ik, _pick_last(values)): print ok, ov Simone Leo Big Data & Hadoop
  • 17. Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop Mock Implementation – mockwc.py from mockmr import mapreduce DATA = """the quick brown fox ate the lazy green fox """ def map_(k, v): for w in v.split(): yield w, 1 def reduce_(k, values): yield k, sum(v for v in values) if __name__ == "__main__": mapreduce(DATA, map_, reduce_) Simone Leo Big Data & Hadoop
  • 18. Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop MapReduce: Execution Model Mapper Reducer split 0 split 1 split 4 split 3 split 2 User Program Master Input (DFS) Intermediate files (on local disks) Output (DFS) (1) fork (2) assign map (2) assign reduce (1) fork (1) fork (3) read (4) local write (5) remote read (6) write Mapper Mapper Reducer adapted from Zhao et al., MapReduce: The programming model and practice, 2009 – http://research.google.com/pubs/pub36249.html Simone Leo Big Data & Hadoop
  • 19. Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop MapReduce vs Alternatives – 1 Programmingmodel DeclarativeProcedural Flat raw files Data organization Structured MPI DBMS/SQL MapReduce adapted from Zhao et al., MapReduce: The programming model and practice, 2009 – http://research.google.com/pubs/pub36249.html Simone Leo Big Data & Hadoop
  • 20. Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop MapReduce vs Alternatives – 2 MPI MapReduce DBMS/SQL programming model message passing map/reduce declarative data organization no assumption files split into blocks organized structures data type any (k,v) string/Avro tables with rich types execution model independent nodes map/shuffle/reduce transaction communication high low high granularity fine coarse fine usability steep learning curve simple concept runtime: hard to debug key selling point run any application huge datasets interactive querying There is no one-size-fits-all solution Choose according to your problem’s characteristics adapted from Zhao et al., MapReduce: The programming model and practice, 2009 – http://research.google.com/pubs/pub36249.html Simone Leo Big Data & Hadoop
  • 21. Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop Outline 1 Big Data 2 MapReduce and Hadoop The MapReduce Model Hadoop 3 Pydoop Motivation Architecture Usage Simone Leo Big Data & Hadoop
  • 22. Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop Hadoop: Overview Disclaimer First we describe Hadoop 1 Then we highlight what’s different in Hadoop 2 Scalable Thousands of nodes Petabytes of data over 10M files Single file: Gigabytes to Terabytes Economical Open source COTS Hardware (but master nodes should be reliable) Well-suited to bag-of-tasks applications (many bio apps) Files are split into blocks and distributed across nodes High-throughput access to huge datasets WORM storage model Simone Leo Big Data & Hadoop
  • 23. Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop Hadoop: Architecture MR MasterHDFS Master Client Namenode Job Tracker Task Tracker Task Tracker Datanode Datanode Datanode Task Tracker Job Request HDFS Slaves MR Slaves Client sends Job request to JT JT queries NN about data block locations Input stream is split among n map tasks Map tasks are scheduled closest to where data reside Simone Leo Big Data & Hadoop
  • 24. Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop Hadoop Distributed File System (HDFS) file name "a" block IDs, datanodes a3 a2 read data Client Namenode Secondary Namenode a1 Datanode 1 Datanode 3 Datanode 2 Datanode 6 Datanode 5 Datanode 4 Rack 1 Rack 2 Rack 3 a3 a1 a3 a2 a1 a2 namespace checkpointing helps namenode with logs not a namenode replacement Large blocks (∼100 MB) Each block is replicated n times (3 by default) One replica on the same rack, the others on different racks You have to provide network topology Simone Leo Big Data & Hadoop
  • 25. Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop Wordcount: (part of) Java Code public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } }} public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce( Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val: values) sum += val.get(); result.set(sum); context.write(key, result); }} Simone Leo Big Data & Hadoop
  • 26. Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop Other MapReduce Components Combiner (local reducer) InputFormat / RecordReader Translates the byte-oriented view of input files into the record-oriented view required by the Mapper Directly accesses HDFS files Processing unit: input split Partitioner Decides which Reducer receives which key Typically uses a hash function of the key OutputFormat / RecordWriter Writes key/value pairs output by the Reducer Directly accesses HDFS files Simone Leo Big Data & Hadoop
  • 27. Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop Input Formats and HDFS HDFS Input Format Job Input split = HDFS block An HDFS block is an actual data chunk in an HDFS file An input split is a logical representation of a subset of the MapReduce input dataset Simone Leo Big Data & Hadoop
  • 28. Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop Hadoop 2 HDFS Federation Multiple namenodes can share the burden of maintaining the namespace(s) and managing blocks Better scalability HDFS high availability (HA) Can run redundant namenodes in active/passive mode Allows failover in case of crash or planned manteinance YARN Resource manager: global assignment of compute resources to applications (HA supported since 2.4.0) Application master(s) per-application scheduling and coordination (like a per-job JT, on a slave) MapReduce is just one of many possible frameworks Simone Leo Big Data & Hadoop
  • 29. Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage Outline 1 Big Data 2 MapReduce and Hadoop The MapReduce Model Hadoop 3 Pydoop Motivation Architecture Usage Simone Leo Big Data & Hadoop
  • 30. Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage Outline 1 Big Data 2 MapReduce and Hadoop The MapReduce Model Hadoop 3 Pydoop Motivation Architecture Usage Simone Leo Big Data & Hadoop
  • 31. Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage Developing your First MapReduce Application The easiest path for beginners is Hadoop Streaming Java package included in Hadoop Use any executable as the mapper or reducer Read key-value pairs from standard input Write them to standard output Text protocol: records are serialized as <k>t<v>n Word count mapper (Python) #!/usr/bin/env python import sys for line in sys.stdin: for word in line.split(): print "%st1" % word Simone Leo Big Data & Hadoop
  • 32. Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage Hadoop Streaming – Word Count Reducer (Python) #!/usr/bin/env python import sys def serialize(key, value): return "%st%d" % (key, value) def deserialize(line): key, value = line.split("t", 1) return key, int(value) def main(): prev_key, out_value = None, 0 for line in sys.stdin: key, value = deserialize(line) if key != prev_key: if prev_key is not None: print serialize(prev_key, out_value) out_value = 0 prev_key = key out_value += value print serialize(key, out_value) if __name__ == "__main__": main() Simone Leo Big Data & Hadoop
  • 33. Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage MapReduce Development with Hadoop Java: native C++: MapReduce API (via Hadoop Pipes) + libhdfs (included in the Hadoop distribution) Python: several solutions, but do they meet all of the requirements of nontrivial apps? Reuse existing modules, including extensions NumPy / SciPy for numerical computation Specialized components (RecordReader/Writer, Partitioner) HDFS access . . . Simone Leo Big Data & Hadoop
  • 34. Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage Python MR: Hadoop-Integrated Solutions Hadoop Streaming Awkward programming style Can only write mapper and reducer scripts (no RecordReader, etc.) No HDFS access Jython Incomplete standard library Most third-party packages are only compatible with CPython Cannot use C/C++ extension modules Typically one or more releases behind CPython Simone Leo Big Data & Hadoop
  • 35. Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage Python MR: Third Party Solutions Most third party packages are Hadoop Streaming wrappers They add usability, but share many of Streaming’s limitations Most are not actively developed anymore Our solution: Pydoop (https://github.com/crs4/pydoop) Actively developed since 2009 (Almost) pure Python Pipes client + libhdfs wrapper Simone Leo Big Data & Hadoop
  • 36. Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage Pydoop: Overview Access to most MR components, including RecordReader, RecordWriter and Partitioner Get configuration, set counters and report status Event-driven programming model (like the Java API, but with the simplicity of Python) You define classes, the MapReduce framework instantiates them and calls their methods CPython → use any module HDFS API Simone Leo Big Data & Hadoop
  • 37. Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage Summary of Features Streaming Jython Pydoop C/C++ Ext Yes No Yes Standard Lib Full Partial Full MR API No* Full Partial Event-driven prog No Yes Yes HDFS access No Yes Yes (*) you can only write the map and reduce parts as executable scripts. Simone Leo Big Data & Hadoop
  • 38. Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage Outline 1 Big Data 2 MapReduce and Hadoop The MapReduce Model Hadoop 3 Pydoop Motivation Architecture Usage Simone Leo Big Data & Hadoop
  • 39. Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage Hadoop Pipes Java Hadoop Framework PipesMapRunner PipesReducer C++ Application child process record writer reducer child process record reader mapper partitioner combiner downward protocol upward protocol downward protocol upward protocol App: separate process Communication with Java framework via persistent sockets The C++ app provides a factory used by the framework to create MR components Simone Leo Big Data & Hadoop
  • 40. Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage Integration of Pydoop with Hadoop Java pipes Hadoop Java hdfs C libhdfs (JNI wrapper) (de)serialization core ext hdfs ext Pure Python Modules Simone Leo Big Data & Hadoop
  • 41. Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage Outline 1 Big Data 2 MapReduce and Hadoop The MapReduce Model Hadoop 3 Pydoop Motivation Architecture Usage Simone Leo Big Data & Hadoop
  • 42. Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage Python Wordcount, Full Program Code import pydoop.mapreduce.api as api import pydoop.mapreduce.pipes as pp class Mapper(api.Mapper): def map(self, context): words = context.value.split() for w in words: context.emit(w, 1) class Reducer(api.Reducer): def reduce(self, context): s = sum(context.values) context.emit(context.key, s) def __main__(): pp.run_task(pp.Factory(Mapper, Reducer)) Simone Leo Big Data & Hadoop
  • 43. Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage Status Reports and Counters import pydoop.mapreduce.api as api class Mapper(api.Mapper): def __init__(self, context): super(Mapper, self).__init__(context) context.setStatus("initializing mapper") self.input_words = context.get_counter("WC", "INPUT_WORDS") def map(self, context): words = context.value.split() for w in words: context.emit(w, 1) context.increment_counter(self.input_words, len(words)) Simone Leo Big Data & Hadoop
  • 44. Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage Status Reports and Counters: Web UI Simone Leo Big Data & Hadoop
  • 45. Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage Optional Components: Record Reader from pydoop.utils.serialize import serialize_to_string as to_s import pydoop.hdfs as hdfs class Reader(api.RecordReader): def __init__(self, context): super(Reader, self).__init__(context) self.isplit = context.input_split self.file = hdfs.open(self.isplit.filename) self.file.seek(self.isplit.offset) self.bytes_read = 0 if self.isplit.offset > 0: discarded = self.file.readline() self.bytes_read += len(discarded) def next(self): if self.bytes_read > self.isplit.length: raise StopIteration key = to_s(self.isplit.offset + self.bytes_read) record = self.file.readline() if record == "": # end of file raise StopIteration self.bytes_read += len(record) return key, record Simone Leo Big Data & Hadoop
  • 46. Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage Optional Components: Record Writer, Partitioner class Writer(api.RecordWriter): def __init__(self, context): super(Writer, self).__init__(context) jc = context.job_conf part = jc.get_int("mapred.task.partition") out_dir = jc["mapred.work.output.dir"] outfn = "%s/part-%05d" % (out_dir, part) hdfs_user = jc.get("pydoop.hdfs.user", None) self.file = hdfs.open(outfn, "w", user=hdfs_user) self.sep = jc.get("mapred.textoutputformat.separator", "t") def close(self): self.file.close() self.file.fs.close() def emit(self, key, value): self.file.write("%s%s%sn" % (key, self.sep, value)) class Partitioner(api.Partitioner): def partition(self, key, n_red): reducer_id = (hash(key) & sys.maxint) % n_red return reducer_id Simone Leo Big Data & Hadoop
  • 47. Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage The HDFS Module >>> import pydoop.hdfs as hdfs >>> hdfs.mkdir("test") >>> hdfs.dump("hello", "test/hello.txt") >>> text = hdfs.load("test/hello.txt") >>> print text hello >>> hdfs.ls("test") [’hdfs://localhost:9000/user/simleo/test/hello.txt’] >>> hdfs.stat("test/hello.txt").st_size 5L >>> hdfs.path.isdir("test") True >>> hdfs.cp("test", "test.copy") >>> with hdfs.open("test/hello.txt", "r") as fi: ... print fi.read(3) ... hel Simone Leo Big Data & Hadoop
  • 48. Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage HDFS Usage by Block Size import pydoop.hdfs as hdfs ROOT = "pydoop_test_tree" def usage_by_bs(fs, root): stats = {} for info in fs.walk(root): if info["kind"] == "directory": continue bs = int(info["block_size"]) size = int(info["size"]) stats[bs] = stats.get(bs, 0) + size return stats if __name__ == "__main__": with hdfs.hdfs() as fs: print "BS (MB)tBYTES USED" for k, v in sorted(usage_by_bs(fs, ROOT).iteritems()): print "%dt%d" % (k/2**20, v) Simone Leo Big Data & Hadoop
  • 49. Big Data MapReduce and Hadoop Pydoop Thank you for your time! Simone Leo Big Data & Hadoop