Extending Hadoop for Fun & Profit

Extending Hadoop for
Fun & Proﬁt
Milind Bhandarkar

Chief Scientist, Pivotal Software,

(Twitter: @techmilind)

About Me
• http://www.linkedin.com/in/milindb

• Founding member of Hadoop team atYahoo! [2005-2010]

• Contributor to Apache Hadoop since v0.1

• Built and led Grid SolutionsTeam atYahoo! [2007-2010]

• Parallel Programming Paradigms [1989-today] (PhD cs.illinois.edu)

• Center for Development of Advanced Computing (C-DAC),
National Center for Supercomputing Applications (NCSA), Center
for Simulation of Advanced Rockets, Siebel Systems (acquired by
Oracle), Pathscale Inc. (acquired by QLogic),Yahoo!, LinkedIn, and
Pivotal (formerly Greenplum)

Agenda
• Extending MapReduce

• Functionality

• Performance

• Beyond MapReduce withYARN

• Hamster & GraphLab

• Extending HDFS

• Q & A

MapReduce Overview
•Record = (Key,Value)

•Key : Comparable, Serializable

•Value: Serializable

•Logical Phases: Input, Map, Shufﬂe, Reduce,
Output

Map
•Input: (Key1,Value1)

•Output: List(Key2,Value2)

•Projections, Filtering,Transformation

Shufﬂe
•Input: List(Key2,Value2)

•Output

•Sort(Partition(List(Key2, List(Value2))))

•Provided by Hadoop : Several
Customizations Possible

Reduce
•Input: List(Key2, List(Value2))

•Output: List(Key3,Value3)

•Aggregations

Configuration
•Unified Mechanism for

•Configuring Daemons

•Runtime environment for Jobs/Tasks

•Defaults: *-default.xml

•Site-Specific: *-site.xml

•final parameters

<configuration>
<property>
<name>mapred.job.tracker</name>
<value>head.server.node.com:9001</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://head.server.node.com:9000</value>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx512m</value>
<final>true</final>
</property>
....
</configuration>
Example

Extending Input Phase
• Convert ByteStream to List(Key,Value)

• Several Formats pre-packaged

• TextInputFormat<long, Text>!
• SequenceFileInputFormat<K,V>!
• KeyValueTextInputFormat<Text,Text>!
• Specify InputFormat for each job

• JobConf.setInputFormat()

InputFormat
•getSplits() : From Input descriptors,
get Input Splits, such that each Split can be
processed independently

•<FileName, startOffset,
length>!
•getRecordReader() : From an
InputSplit, get list of Records

Industry Use Case

!
SurveillanceVideo Anomaly Detection

Acknowledgements
• Victor Fang

• Regu Radhakrishnan

• Derek Lin

• SameerTiwari

Anomaly Detection in
SurveillanceVideo
• Detect anomalous objects in a restricted
perimeter

• Typical large enterprise collectsTB’s video per day

• Hadoop MapReduce runs computer vision
algorithms in parallel and captures violation
events

• Post-Incident monitoring enabled by Interactive
Query

Video DataFlow
•TimestampedVideo Files as input

•DistributedVideoTranscoding : ETL in
Hadoop

•DistributedVideo Analytics in Hadoop/
HAWQ

•Insights in relational DB

Real WorldVideo Data
• Benchmark Surveillance
videos from UK Home
Ofﬁce (iLids)

• CCTVVideo footage
depicting scenarios
central to Govt
requirements

CommonVideo
Standards
• MPEG & ITU
responsible for
most video
standards

• MPEG-2 (1995)
Widely adopted in
DVDs, TV, SetTop
boxes

MPEG Standard Format
•Sequence of encoded video frames

•Compression by eliminating:

•Redundancy inTime: Inter-Frame Encoding

•Redundancy in Space: Intra-Frame
Encoding

Motion Compensation
• I-Frame: Intra-Frame
encoding

• P-Frame: Predicated
frame from previous
frame
• B-Frame: Predicted frame
from both previous &
next frame

Distributed MPEG
Decoding
•HDFS splits large ﬁles in 64 MB/128 MB
blocks

•Each HDFS block can be processed
independently by a Map task

•Can we decode individual video frames from
an arbitrary HDFS block in an MPEG File ?

Splitting MPEG-2
• Header Information available only once per ﬁle

• Group of Pictures (GOP) header repeats

• Each GOP starts with an I-Frame and ends with
an I-Frame

• Each GOP can be decoded independently

• First and last GOP may straddle HDFS blocks

MPEG2InputFormat
•Derived from FileInputFormat

•getSplits() : Identical to
FileInputFormat

•InputSplit = HDFS Block

•getRecordReader()!
•MPEG2RecordReader

MPEG2RecordReader
•Start from beginning of block

•Search for the ﬁrst GOP Header

•Locate an I-Frame, decode, keep in memory

•If P-Frame, decode using last frame

•If B-Frame, keep current frame in memory,
read next frame, decode current frame

Considerations for Input
Format
•Use as little metadata as possible

•Number of Splits = Number of MapTasks

•Combine small ﬁles

•Split determination happens in a single
process, so should be metadata-based

•Affects scalability of MapReduce

Scalability
•If one node processes k MB/s, then N nodes
should process (k*N) MB/s

•If some ﬁxed amount of data is processed in
T minutes on one node, the N nodes should
process same data in (T/N) minutes

•Linear Scalability

Reduce Latency
Minimize Job Execution time

Increase
Throughput
Maximize amount of data
processed per unit time

Amdahl’s Law
S =
N
1+!(N !1)

Multi-Phase
Computations
•If computation C is split into N different
parts, C1..CN

•If partial computation Ci can be speeded up
by a factor of Si

Amdahl’s Law, Restated
S =
Ci
i=1
N
∑
Ci
Sii=1
N
∑

Amdahl’s Law
• Suppose Job has 5 phases: P0 is 10 seconds, P1,
P2, P3 are 200 seconds each, and P4 is 10
seconds

• Sequential runtime = 620 seconds

• P1, P2, P3 parallelized on 100 machines with
speedup of 80 (Each executes in 2.5
seconds)

• After parallelization, runtime = 27.5 seconds

• Effective Speedup: (620s/27.5s) = 22.5

Why Shufﬂe ?
•Often, the most expensive phase in
MapReduce, involves slow disks and network

•Map tasks partition, sort and serialize
outputs, and write to local disk

•Reduce tasks pull individual Map outputs
over network, merge, and may spill to disk

Message Cost Model
T = α + Nβ

Message Granularity
•For Gigabit Ethernet

•α = 300 μS

•β = 100 MB/s

•100 Messages of 10KB each = 40 ms

•10 Messages of 100 KB each = 13 ms

Alpha-Beta
• Common Mistake:Assuming that α is constant

• Scheduling latency for responder

• MR daemons time slice inversely proportional to
number of concurrent tasks

• Common Mistake:Assuming that β is constant

• Network congestion

• TCP incast

Efﬁcient Hardware
Platforms
•Mellanox - Hadoop Acceleration through
Network-assisted Merge

•RoCE - Brocade, Cisco, Extreme,Arista...

•SSD -Velobit,Violin, FusionIO, Samsung..

•Niche - Compression, Encryption...

Pluggable Shufﬂe & Sort
•Replace HTTP-based pull with RDMA

•Avoid spilling altogether

•Replace default Sort implementation with
Job-optimized sorting algorithm

•Experimental APIs

•google
PluggableShuffleAndPluggableSort.html

Mellanox UDA
• Developed jointly with
Auburn University

• 2x Performance on
TeraSort

• Reduces disk writes by
45%, disk reads by 15%

Single'App'
BATCH
HDFS
Single'App'
INTERACTIVE
Single'App'
BATCH
HDFS
Single'App'
BATCH
HDFS
Single'App'
ONLINE
Hadoop 1.0
(Image Courtesy Arun Murthy, Hortonworks)

MapReduce 1.0

Hadoop 2.0
HADOOP 1.0
HDFS%
(redundant,*reliable*storage)*
MapReduce%
(cluster*resource*management*
*&*data*processing)*
HDFS2%
(redundant,*reliable*storage)*
YARN%
(cluster*resource*management)*
Tez%
(execu7on*engine)*
HADOOP 2.0
Pig%
(data*ﬂow)*
Hive%
(sql)*
%
Others%
(cascading)*
*
Pig%
(data*ﬂow)*
Hive%
(sql)*
%
Others%
(cascading)*
%
MR%
(batch)*
RT%%
Stream,%
Graph%
Storm,''
Giraph'
*
Services%
HBase'
*

Applica'ons+Run+Na'vely+IN+Hadoop+
HDFS2+(Redundant,*Reliable*Storage)*
YARN+(Cluster*Resource*Management)***
BATCH+
(MapReduce)+
INTERACTIVE+
(Tez)+
STREAMING+
(Storm,+S4,…)+
GRAPH+
(Giraph)+
INLMEMORY+
(Spark)+
HPC+MPI+
(OpenMPI)+
ONLINE+
(HBase)+
OTHER+
(Search)+
(Weave…)+
YARN Platform

NodeManager* NodeManager* NodeManager* NodeManager*
Container*1.1*
Container*2.4*
Container*1.2*
Container*1.3*
AM*1*
Container*2.2*
Container*2.1*
Container*2.3*
AM2*
Client2*
ResourceManager*
Scheduler*
YARN Architecture

YARN
•Yet Another Resource Negotiator

•Resource Manager

•Node Managers

•Application Masters

•Speciﬁc to paradigm, e.g. MR Application
master (aka JobTracker)

Beyond MapReduce
•Apache Giraph - BSP & Graph Processing

•Storm onYarn - Streaming Computation

•HOYA - HBase onYarn

•Hamster - MPI on Hadoop

•More to come ...

Hamster
• Hadoop and MPI on the same
cluster

• OpenMPI Runtime on
HadoopYARN

• Hadoop Provides: Resource
Scheduling, Process
monitoring, Distributed File
System

• Open MPI Provides: Process
launching, Communication, I/O
forwarding

Hamster Components
•Hamster Application Master

•Gang Scheduler,YARN Application
Preemption

•Resource Isolation (lxc Containers)

•ORTE: Hamster Runtime

•Process launching,Wireup, Interconnect

Resource Manager
Scheduler
AMService
Node Manager Node Manager Node Manager
…
Proc/
Container
Framework
Daemon
NS
MPI
Scheduler
HNP
MPI AM
Proc/
Container
…RM-AM
AM-NM
RM-NodeManagerClient
Client-RM
Aux Srvcs
Proc/
Container
Framework
Daemon
NS
Proc/
Container
…
Aux Srvcs
RM-
NodeManager
Hamster Architecture

Hamster Scalability
•Sufﬁcient for small to medium HPC
workloads

•Job launch time gated byYARN resource
scheduler
Launch WireUp Collective
s
Monitor
OpenMPI O(logN) O(logN) O(logN) O(logN)
Hamster O(N) O(logN) O(logN) O(logN)

GraphLab + Hamster
on Hadoop
!

About GraphLab
•Graph-based, High-Performance distributed
computation framework

•Started by Prof. Carlos Guestrin in CMU in
2009

•Recently founded Graphlab Inc to
commercialize Graphlab.org

GraphLab Features
•Topic Modeling (e.g. LDA)

•Graph Analytics (Pagerank,Triangle counting)

•Clustering (K-Means)

•Collaborative Filtering

•Linear Solvers

•etc...

Only Graphs are not
Enough
•Full Data processing workﬂow required ETL/
Postprocessing,Visualization, Data Wrangling,
Serving

•MapReduce excels at data wrangling

•OLTP/NoSQL Row-Based stores excel at
Serving

•GraphLab should co-exist with other Hadoop
frameworks

HCFS
•Hadoop Compatible File Systems

•FileSystem, FileContext

•S3, Local FS, webhdfs

•Azure Blob Storage, CassandraFS, Ceph,
CleverSafe, Google Cloud Storage, Gluster,
Lustre, QFS, EMCViPR (more to come)

New Dataset
•Reuse Namenode and Datanode
implementations

•Substitute a different DataSet
implementation: FsDatasetSpi,
FsVolumeSpi

•Jira: HDFS-5194

Extending Namenode
•Pluggable Namespace: HDFS-5324,
HDFS-5389

•Pluggable Block Management: HDFS-5477

•Requires ﬁne-grained locking in Namenode:
HDFS-5453

Extending Hadoop for Fun & Profit

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (12)

Ähnlich wie Extending Hadoop for Fun & Profit

Ähnlich wie Extending Hadoop for Fun & Profit (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Extending Hadoop for Fun & Profit