Hadoop: The Default Machine Learning Platform ?

Hadoop:The Default
Machine Learning
Platform ?
Milind Bhandarkar
Chief Scientist, Pivotal
@techmilind
Wednesday, December 18, 2013

About Me
• http://www.linkedin.com/in/milindb
• Founding member of Hadoop team atYahoo! [2005-2010]
• Contributor to Apache Hadoop since v0.1
• Built and led Grid SolutionsTeam atYahoo! [2007-2010]
• Parallel Programming Paradigms [1989-today] (PhD cs.illinois.edu)
• Center for Development of Advanced Computing (C-DAC), National Center
for Supercomputing Applications (NCSA), Center for Simulation of Advanced
Rockets, Siebel Systems (acquired by Oracle), Pathscale Inc. (acquired by
QLogic),Yahoo!, LinkedIn, and Pivotal (formerly Greenplum)

Acknowledgements
•Developers of various Open-Source, and
Proprietary Data Platforms
•Ex-Colleagues atYahoo! Research
•Colleagues at Data ScienceTeam, Pivotal
•Vijay Narayanan, Microsoft

Kryptonite: First Hadoop Cluster AtYahoo!

M45:Academic Collaboration

Analytics Workbench

!"#$%&'())'
BATCH
HDFS
!"#$%&'())'
INTERACTIVE
!"#$%&'())'
BATCH
HDFS
!"#$%&'())'
BATCH
HDFS
!"#$%&'())'
ONLINE
Hadoop 1.0
(Image Courtesy Arun Murthy, Hortonworks)

MapReduce 1.0

ML in MapReduce:Why ?

• High data throughput: 100TB/hr using 500 mappers

• Framework provides fault tolerance
• Monitors and and re-starts tasks on other machines should
one of the machines fail

• Excels in counting patterns over data records

• Built on relatively cheap, commodity hardware

• Built on relatively cheap, commodity hardware
• Large volumes of data already stored on Hadoop clusters
running MapReduce

• Learning can become limited by computation
time and not data volume
• With large enough data and number of machines
• Reduces the need to down-sample data
• More accurate parameter estimates compared to
learning on a single machine for the same amount of
time

Learning Models in
MapReduce
• Data parallel algorithms are most appropriate for
MapReduce implementations
• Not necessarily the most optimal implementation for a
speciﬁc algorithm
• Other specialized non-MapReduce implementations
exist for some algorithms, which may be better
• MR may not be the appropriate framework for exact
solutions of non data parallel/sequential algorithms
• Approximate solutions using MR may be good enough

Types of Learning in MapReduce
•Parallel training of multiple models
•Train either in Mappers or Reducers
•Ensemble training methods
•Train multiple models and combine them
•Distributed learning algorithms
•Learn using both Mappers and Reducers

ParallelTraining of
Multiple Models
• Train multiple models simultaneously using a learning algorithm that can
be learnt in memory
• Useful when individual models are trained using a subset, ﬁltered or
modiﬁcation of raw data
• Train 1 model in each reducer
• Map:
• Input:All data
• Filters subset of data relevant for each model training
• Output: <model_index, subset of data for training this model>
• Reduce
• Train model on data corresponding to that model_index

Distributed Learning
Algorithms
• Suitable for learning algorithms that are
• Compute-Intensive per data record
• One or few iterations for learning
• Do not transfer much data between iterations
• Typical algorithms
• Fit the Statistical query model (SQM)
• Divide and conquer

k-Means Clustering
• Choose k samples as initial cluster centroids
• In each MapReduce Iteration:
• Assign membership of each point to closest cluster
• Re-compute new cluster centroids using assigned members

• Control program to
• Initialize the centroids
• random, initial clustering on sample etc.
• Run the MapReduce iterations
• Determine stopping criterion

k-Means Clustering in MapReduce

• Map
• Input data points: x_i
• Input cluster centroids: c_i
• Assign each data point to closest cluster
• Output

• Map
• Input data points: x_i
• Input cluster centroids: c_i
• Assign each data point to closest cluster
• Output
• Reduce
• Compute new centroids for each cluster

Complexity of k-Means
Clustering
• Each point is compared with each cluster
centroid
• Complexity = N*K*O(d) where O(d) is the
complexity of the distance metric
• Typical Euclidean distance is not a cheap
operation
• Can reduce complexity using an initial canopy
clustering to partition data cheaply

Apache Mahout
• Goal
• Create scalable, machine learning algorithms under the
Apache license
• Contains both:
• Hadoop implementations of algorithms that scale linearly
with data.
• Fast sequential (non MapReduce) algorithms
• Wiki:
• https://cwiki.apache.org/conﬂuence/display/MAHOUT/
Mahout+Wiki

Algorithms in Mahout
• Classiﬁcation:
• Logistic Regression, Naïve Bayes, Complementary Naïve
Bayes, Random Forests
• Clustering
• K-means, Fuzzy k-means, Canopy, Mean-shift clustering,
Dirichlet Process clustering, Latent Dirichlet allocation,
Spectral clustering
• Parallel FP growth
• Item based recommendations
• Stochastic Gradient Descent (sequential)

Challenges for ML with
MapReduce
•MapReduce is optimized for large batch data
processing
•Assumes data parallelism
•Ideal for shared-nothing computing
•Many learning algorithms are iterative
•Incur signiﬁcant overheads per iteration

Challenges (contd.)
•Multiple scans of the same data
•Typically once per iteration: High I/O
overhead reading data into mappers per
iteration
•In some algorithms static data is read into
mappers in each iteration
•e.g. input data in k-means clustering.

Challenges (contd.)
•Need a separate controller outside the
framework to:
•Coordinate the multiple MapReduce jobs
for each iteration
•Perform some computations between
iterations and at the end
•Measure and implement stopping criterion

Challenges (contd.)
• Incur multiple task initialization overheads
• Setup and tear down mapper and reducer
tasks per iteration
• Transfer/shufﬂe static data between mapper
and reducer repeatedly
• Intermediate data is transferred through
index/data ﬁles on local disks of mappers
and pulled by reducers

Challenges (contd.)
•Blocking architecture
•Reducers cannot start till all map jobs
complete
•Availability of nodes in a shared environment
•Wait for mapper and reducer nodes to
become available in each iteration in a
shared computing cluster

Iterative Algorithms in MapReduce
PassResult

Iterative Algorithms in MapReduce
Overhead per Iteration:
• Job setup
• Data Loading
• Disk I/O
PassResult

Enhancements to
MapReduce
• Many proposals to overcome these challenges
• All try to retain the core strengths of data
partitioning and fault tolerance of Hadoop to
various degrees
• Proposed enhancements and alternatives to
Hadoop
• Worker/Aggregator framework, HaLoop,
MapReduce Online, iMapReduce, Spark,Twister,
Hadoop ML, ...

Worker/Aggregator
12/17/13102
FinalResultWednesday, December 18, 2013

Worker/Aggregator
12/17/13102
Advantages:
• Schedule once per Job
• Data stays in memory
• P2P communication
FinalResultWednesday, December 18, 2013

HaLoop
• Programming model and architecture for iterations
• New APIs to express iterations in the framework
• Loop-aware task scheduling
• Physically co-locate tasks that use the same data
in different iterations
• Remember association between data and node
• Assign task to node that uses data cached in that
node

HaLoop (contd.)
• Caching for loop invariant data:
• Detect invariants in first iteration, cache on local
disk to reduce I/O and shuffling cost in subsequent
iterations
• Cache for Mapper inputs, Reducer Inputs, Reducer
outputs
• Caching to support fixpoint evaluation:
• Avoids the need for a dedicated MR step on each
iteration

Spark
• Open Source Cluster Computing model:
• Different from MapReduce, but retains some basic
character
• Optimized for:
• Iterative computations
• Applies to many learning algorithms
• Interactive data mining
• Load data once into multiple mappers and run
multiple queries

Spark (contd.)
• Programming model using working sets
• applications reuse intermediate results in multiple parallel
operations
• preserves the fault tolerance of MapReduce
• Supports
• Parallel loops over distributed datasets
• Loads data into memory for (re)use in multiple iterations
• Access to shared variables accessible from multiple machines
• Implemented in Scala,
• www.spark-project.org

Hadoop 2.0
HADOOP 1.0
!"#$%
!"#$%&$'&()*"#+,'-+#*.(/"'0#1*
&'()*+,-*%
!2+%.(#"*"#./%"2#*3'&'0#3#&(*
*4*$'('*5"/2#..,&01*
!"#$.%
!"#$%&$'&()*"#+,'-+#*.(/"'0#1*
/0)1%
!2+%.(#"*"#./%"2#*3'&'0#3#&(1*
2*3%
!#6#2%7/&*#&0,&#1*
HADOOP 2.0
456%
!$'('*8/91*
!57*%
!.:+1*
%
89:*;<%
!2'.2'$,&01*
*
456%
!$'('*8/91*
!57*%
!.:+1*
%
89:*;<%
!2'.2'$,&01*
%
&)%
!-'(2;1*
)2%%
$9;*'=>%
?;'(:%
!"#$%&''
()$*+,'
*
$*;75-*<%
-.*/0'
*

!""#$%&'()*+,-)+.&'/0#1+2.+3&4(("+
35678+!"#$%&$'&()*"#+,'-+#*.(/0'1#2*
9!,.+!3+%4(#0*"#4/%05#*6'&'1#7#&(2***
:!;<3+
=>&",04-%0?+
2.;@,!<;2A@+
=;0B?+
7;,@!>2.C+
=7D(EFG+7HGI?+
C,!J3+
=C$E&"K?+
2.L>@>M,9+
=7"&EN?+
3J<+>J2+
=M"0)>J2?+
M.O2.@+
=3:&*0?+
M;3@,+
=70&E%K?+
=P0&/0I?+
YARN Platform

!"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)*
+"',&-'$)*./.*
+"',&-'$)*0/1*
!"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)*
!"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)*
+"',&-'$)*./0*
+"',&-'$)*./2*
3%*.*
+"',&-'$)*0/0*
+"',&-'$)*0/.*
+"',&-'$)*0/2*
3%0*
+4-$',0*
5$6"7)8$%&'&($)*
98:$#74$)*
YARN Architecture

YARN
•Yet Another Resource Negotiator
•Resource Manager
•Node Managers
•Application Masters
•Speciﬁc to paradigm, e.g. MR Application
master (aka JobTracker)

MPP SQL On Hadoop

SQL-on-Hadoop
•Pivotal HAWQ
•Cloudera Impala, Facebook Presto,Apache
Drill, Cascading Lingual, Optiq, Hortonworks
Stinger
•Hadapt, Jethrodata, IBM BigSQL, Microsoft
PolyBase
•More to come...

Network
Interconnect
...
......HAWQ & HDFS
Master
Severs
Planning & dispatch
Segment
Severs
Query execution
...
Storage
HDFS, HBase …

Namenode
B
replication
Rack1 Rack2
DatanodeDatanode Datanode
Read/Write
Segment
Segment host
Segment
Segment
Segment host
Segment
Segment host
Master host
Meta Ops
HAWQ Interconnect
Segment
Segment
Segment
Segment host
Segment
Datanode
Segment
SegmentSegment Segment

HAWQ vs Hive
Lower is Better

Provides data-parallel implementations
of mathematical, statistical and machine-learning
methods
for structured and unstructured data.
In-Database Analytics

MADlib Algorithms

MADLib Functions
• Linear Regression
• Logistic Regression
• Multinomial Logistic
Regression
• K-Means
• Association Rules
• Latent Dirichlet Allocation
• Naïve Bayes
• Elastic Net Regression
• DecisionTrees / Random
Forest
• SupportVector Machines
• Cox Proportional Hazards
Regression
• Descriptive Statistics
• ARIMA

k-Means Usage
SELECT * FROM madlib.kmeanspp (
‘customers’, -- name of the input table
‘features’, -- name of the feature array column
2 -- k : number of clusters
);
centroids | objective_fn | frac_reassigned | …
------------------------------------------------------------------------+------------------+-----------------+ …
{{68.01668579784,48.9667382972952},{28.1452167573446,84.5992507653263}} | 586729.010675982 | 0.001 | …

Accessing HAWQ
Through R

Pivotal R
•Interface is R client
•Execution is in database
•Parallelism handled by PivotalR
•Supports a portion of R
R> x = db.data.frame(“t1”)
R> l = madlib.lm(interlocks ~ assets + nation, data = t)

A wrapper of MADlib
• Linear regression
• Logistic regression
• Elastic Net
• ARIMA
• Table summary

A wrapper of MADlib
• Elastic Net
• ARIMA
• Table summary
• $ [ [[ $<- [<- [[<-
• is.na
• + - * / %% %/%
^
• & | !
• == != > < >= <=
• merge
• by
• db.data.frame
• as.db.data.frame
• preview• sort
• c mean sum sd var min max
length colMeans colSums
• db.connect db.disconnect db.list
db.objects
db.existsObject delete
• dim names
• content
And more ... (SQL wrapper)
• predict

A wrapper of MADlib
• Elastic Net
• ARIMA
• Table summary
• Categorial variable
as.factor()
• $ [ [[ $<- [<- [[<-
• is.na
• + - * / %% %/%
^
• & | !
• == != > < >= <=
• merge
• by
• db.data.frame
• as.db.data.frame
• preview• sort
• c mean sum sd var min max
length colMeans colSums
• db.connect db.disconnect db.list
db.objects
db.existsObject delete
• dim names
• content
And more ... (SQL wrapper)
• predict

Pivotal Confidential–Internal Use Only 49
db.obj
db.data.frame db.Rquery
db.table db.view

db.obj
db.table db.view
Wrapper of objects in database
x = db.data.frame("table")

db.obj
db.table db.view
Resides in R only
x[,1:2],
merge(x, y, by="column"),
etc.

db.obj
db.table db.view
Most operations
Resides in R only
x[,1:2],
etc.

db.obj
db.table db.view
as.db.data.frame(...)
Most operations
Resides in R only
x[,1:2],
etc.

db.obj
db.table db.view
Most operations
Resides in R only
x[,1:2],
etc.
MADlib
wrapper
functions

db.obj
db.table db.view
Most operations
Resides in R only
x[,1:2],
etc.
MADlib
wrapper
functions
preview

In-Database Execution
•All data stays in DB: R objects merely point
to DB objects
•All model estimation and heavy lifting done
in DB by MADlib
•R→ SQL translation done in the R client
•Only strings of SQL and model output
transferred across ODBC/DBI

Beyond MapReduce
•Apache Giraph - BSP & Graph Processing
•Storm onYarn - Streaming Computation
•HOYA - HBase onYarn
•Hamster - MPI on Hadoop
•More to come ...

Hamster
• Hadoop and MPI on the same
cluster
• OpenMPI Runtime on Hadoop
YARN
• Hadoop Provides: Resource
Scheduling, Process monitoring,
Distributed File System
• Open MPI Provides: Process
launching, Communication, I/O
forwarding

Hamster Components
•Hamster Application Master
•Gang Scheduler,YARN Application
Preemption
•Resource Isolation (lxc Containers)
•ORTE: Hamster Runtime
•Process launching,Wireup, Interconnect

Resource Manager
Scheduler
AMService
Node Manager Node Manager Node Manager
!
Proc/
Container
Framework
Daemon
NS
MPI
Scheduler
HNP
MPI AM
Proc/
Container
!RM-AM
AM-NM
RM-NodeManagerClient
Client-RM
Aux Srvcs
Proc/
Container
Framework
Daemon
NS
Proc/
Container
!
Aux Srvcs
RM-
NodeManager
Hamster Architecture

Hamster Scalability
•Sufﬁcient for small to medium HPC
workloads
•Job launch time gated byYARN resource
scheduler
Launch WireUp Collectives Monitor
OpenMPI O(logN) O(logN) O(logN) O(logN)
Hamster O(N) O(logN) O(logN) O(logN)

GraphLab + Hamster
on Hadoop
!

About GraphLab
•Graph-based, High-Performance distributed
computation framework
•Started by Prof. Carlos Guestrin in CMU in
2009
•Recently founded Graphlab Inc to
commercialize Graphlab.org

GraphLab Features
•Topic Modeling (e.g. LDA)
•Graph Analytics (Pagerank,Triangle counting)
•Clustering (K-Means)
•Collaborative Filtering
•Linear Solvers
•etc...

Only Graphs are not
Enough
• Full Data processing workﬂow required ETL/
Postprocessing,Visualization, Data Wrangling, Serving
• MapReduce excels at data wrangling
• OLTP/NoSQL Row-Based stores excel at Serving
• GraphLab should co-exist with other Hadoop
frameworks

Questions?

Hadoop: The Default Machine Learning Platform ?

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (10)

Ähnlich wie Hadoop: The Default Machine Learning Platform ?

Ähnlich wie Hadoop: The Default Machine Learning Platform ? (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Hadoop: The Default Machine Learning Platform ?