Open Source Big Data Analytics Toolkits Comparison

Big Data Analytics –
Open Source Toolkits
Prafulla Wani
Snehalata Deorukhkar

Introduction
 Talk Background
– More Data Beats Better Algorithms
– Evaluate “Analytics Toolkits” that support Hadoop
 Speaker Backgrounds
 Data Engineers
 No PhDs in statistics
2

Big Data Analytics Toolkits
 Evaluation parameters
– Ease of use
• Development APIs
• # of Algorithms supported
– Performance
• Scalable Architecture
• Disk-based / Memory-based
 Open-source
only
– RHadoop
– Mahout
– MADLib
– HiveMall
– H2O
– Spark-MLLib
3

Analytics Project lifecycle
Train
Model(s)
Gather
Data
Compare
Accuracy
Predict
Future
 Train Algorithm 1
(Logistic regression)
 Train Algorithm 2
(SVM)
 ......
 Train Algorithm N
4

Analytics (Pre-Hadoop era)
Performance
Ease
of
use
Single
Machine
R,
Octave
R –
 Started in 1993
 Very Popular
 5589 packages
 Written primarily in
C and Fortran
Octave –
 Started in 1988
 Open source and
features
comparable with
Matlab
5

Timeline
2014201320122008
6
2006
Hadoop
20112010
Core Hadoop
(HDFS,
MapReduce)
HBase,
Zookeeper
, Pig,
Hive…
Avro,
Sqoop
Cloudera
Impala
YARN

Architecture
R
R R R R
R R R R
Client/
Edge Node
Hadoop Cluster
Client/
Edge Node
Hadoop Cluster
RHadoop
Mahout
Mahout
Map/
Reduce
Map/
Reduce
7

Timeline
2014201320122008
8
2006
Hadoop
20112010
Core Hadoop
(HDFS,
MapReduce)
HBase,
Zookeeper
, Pig,
Hive…
Avro,
Sqoop
Cloudera
Impala
YARN
RHadoop
rhdfs, rmr rmr 2.0 plyrmr

RHadoop
 Provides R packages –
– rhdfs - to read/write from/to HDFS
– rhbase - to read/write from/to HBase
– rmr - to express map-reduce programs in R
 Does not provide out-of-box packages for
model training
9

RHadoop
logistic.regression = function(input, iterations, dims, alpha){
plane = t(rep(0, dims))
g = function(z) 1/(1 + exp(-z))
for (i in 1:iterations) {
gradient =
values( from.dfs( mapreduce(
input,
map = lr.map,
reduce = lr.reduce,
combine = T)))
plane = plane + alpha * gradient }
plane }
lr.map =
function(., M) {
Y = M[,1]
X = M[,-1]
keyval(
1, +
Y * X *
g(-Y * as.numeric(X %*% t(plane))))}
lr.reduce =
function(k, Z)
keyval(k, t(as.matrix(apply(Z,2,sum))))
10

Timeline
2014201320122008
11
2006
Hadoop
Mahout
Started as a
subproject
of Apache
Lucene
20112010
Decision to reject
new MapReduce
implementation
Future
implementations
on top of Apache
Spark
Integration with
H2O platform
Top level
apache
project
4 releases
(0.1 – 0.4)
Core Hadoop
(HDFS,
MapReduce)
HBase,
Zookeeper
, Pig,
Hive…
Mahout
Avro,
Sqoop
Cloudera
Impala
YARN
0.8
release
Recomme
ndation
Engines –
Common
Case
study for
Hadoop
RHadoop
rhdfs, rmr rmr 2.0 plyrmr

Mahout
 Original goal - To implement all 10 algorithms from Andrew
Ng's paper "Map-Reduce for Machine Learning on
Multicore"
 Java based library having MapReduce implementation of
common analytics algorithms
 Key algorithms
– Recommendation algorithms / Collaborative filtering
– Classification
– Clustering
– Frequent Pattern Growth
12

Mahout
 Train the model:
mahout org.apache.mahout.df.mapreduce.BuildForest -
Dmapred.max.split.size=1884231 -oob -d train.arff -ds
train.info -sl 5 -t 1000 -o crwd_forest
 Test the model:
mahout org.apache.mahout.df.mapreduce.TestForest -i
test.arff -ds train.info -m crwd_forest -a -mr -o
crwd_predictions
13

Summary
Performance
Ease of use
Distributed
Disk-based
Single
Machine
R,
Octave
Mahout
RHadoop
14

Aging MapReduce
 Machine learning algorithms are iterative in nature
 Mahout algorithms involve multiple MapReduce stages
 Intermediate results are written to HDFS
 MR job is launched for each iteration
 IO overhead
Input
Input
HDFS
read
HDFS
read
HDFS
write
HDFS
read
HDFS
write
iter. 1 iter. 2
Query 1
Query 2
Query 3
result 1
result 2
result 3
…
…
Slow due to replication and disk IO
15

Disk Trend
 Disk throughput increasing slowly
Reference - http://www.cs.berkeley.edu/~haoyuan/talks/Tachyon_2013-08-30_AMPCamp2013.pdf
16

Memory Trend
 RAM throughput increasing exponentially
Reference - http://www.cs.berkeley.edu/~haoyuan/talks/Tachyon_2013-08-30_AMPCamp2013.pdf
17

Timeline
2014201320122011
18
2010
Started as
a research
project at
UC
Berkley
AMPLab
2009
Open-
sourced
Accepted
into Apache
incubator
Spark 0.8
release
introduced
MLLib
Spark-MLLib
1.0 released
Wins Best
Paper
Award at
USENIX
NSDI
Spark

Spark – Data sharing
 Resilient Distributed Datasets (RDDs)
– Distributed collections of objects that can be cached in
memory across cluster nodes
– Manipulated through various parallel operations
– Automatically rebuilt on failures
Input
Input
One-time
Processing
iter. 1 iter. 2
Query 1
Query 2
Query 3
…
10-100x faster than network and disk
…
Distributed
memory
19

MLLib
 Spark implementation of some common machine
learning algorithms and utilities, including
– Classification
– Regression
– Clustering
 Pre-packaged libraries (in scala, Java, Python) for
analytics algorithms –
– val model = SVMWithSGD.train(training, numIterations)
– val clusters = KMeans.train(parsedData, numClusters,
numIterations)
20

SparkR - R Interface over Spark
 Currently supports using data transformation
functions lapply() etc. on distributed spark model
 It does not support running out of the box model (e.g.
SVMWithSGD.train or KMeans.train)
 The work is in progress on sparkR - MLLib
integration which may address this limitation
21

Timeline
2014201320122011
22
Began as a
collaboration
between
researchers,
engineers and
data scientists
2010
Initial
release
Started as
a research
project at
UC
Berkley
AMPLab
2009
Open-
sourced
Accepted
into Apache
incubator
Spark 0.8
release
introduced
MLLib
Spark-MLLib
1.0 released
Wins Best
Paper
Award at
USENIX
NSDI
Spark
MADLib
MADLib-
port for
Impala

MADLib
 An open-source library for scalable in-database analytics
 Supports Postgres, Pivotal GreenPlum Database, and
Pivotal HAWQ
 Key MADLib architecture principles are:
– Operating on the data locally-in database.
– Utilizing best of breed database engines, but separate
the machine learning logic from database specific
implementation details.
– Leveraging MPP Share nothing technology, such as the
Pivotal Greenplum Database, to provide parallelism and
scalability.
– Open implementation maintaining active ties into
ongoing academic research."
23

MADLib Architecture
24
User Interface
“Driver” Functions
(outer loops of iterative algorithms, optimizer invocations)
High – level Abstraction Layer
(iteration controller, …)
RDBMS
Built-in
functions
MPP Query Processing
(Greenplum, PostgreSQL, Impala …)
Functions for Inner Loops
(for streaming algorithms)
Low-level Abstraction Layer
(matrix operations, C++ to RDBMS type
bridge, …)
SQL, generated from
specification
C++

Timeline
2014201320122011
25
Began as a
collaboration
between
researchers,
engineers and
data scientists
2010
Initial
release
Started as
a research
project at
UC
Berkley
AMPLab
2009
Open-
sourced
Accepted
into Apache
incubator
Spark 0.8
release
introduced
MLLib
Spark-MLLib
1.0 released
Wins Best
Paper
Award at
USENIX
NSDI
Spark
H2O Project
open-
sourced
MADLib
H2O
Latest stable
release of H2O
2.4.3.4 released
on May 13, 2014
MADLib-
port for
Impala

H2O
 Open source math and prediction engine
 Distributed, in-memory computations
 Creates a cluster of H2O nodes, which are map-
only tasks
 Provides graphical interface to load-data, view
summaries and train models
 Certified for major hadoop distributions
26

H2O on Hadoop Deployment
Hadoop
H2O
Map
Task
Hadoop
H2O
Map
Task
Hadoop
H2O
Map
Task
Job
Tracker
hadoop jar …
HDFS
Hadoop edge Node
Hadoop Cluster
Hadoop Task
Tracker Nodes
(H2O Cluster)
Hadoop HDFS
Data Nodes
27
Reference - http://www.slideshare.net/0xdata/h2o-on-hadoop-dec-12

H2O Programming Interface
 R-Package “H2O”
– prostate.data = h2o.importURL(localH2O, path = “<path>”,
key = “<key>")
– summary(prostate.data)
– h2o.glm
– h2o.kmeans
28

Community involvement
Mahout Spark-MLLib MADLib H2O
# of commits 20 249 0 557
29
For 30 days ending 27 May,

HiveMall
 Machine learning and feature engineering
functions through UDFs/UDAFs/UDTFs of Hive
 Supports various algorithms for –
– Classification – Perceptron, Adaptive Regularization of
Weight Vectors (AROW)
– Regression - Logistic Regression using Stochastic
Gradient Descent
– Recommendation - Minhash (LSH with jaccard index)
– k-Nearest Neighbor
– Feature engineering
30

Summary
Performance
Ease of use
Distributed
Disk-based
Distributed
Memory-based
Single
Machine
R,
Octave
Mahout
RHadoop
H2O
MLLIb
MADLib+
Impala
Hive
Mall
31

MLBase - Vision
 Optimizer built on top of Spark & MLLib
 A Declarative Approach
 Abstracts complexities of variable & algorithm
selection
– var X = load (“als_clinical”, 2 to 10)
– var Y = load (“als_clinical”, 1)
– var (fn-model, summary) = doClassify (X , y)
Reference - http://www.slideshare.net/chaochen5496/mlllib-sparkmeetup8613finalreduced
Train
Model(s)
Gather
Data
Compare
Accuracy
Predict
Future
32

Summary
Performance
Ease of use
Distributed
Disk-based
Distributed
Memory-based
Single
Machine
R,
Octave
Mahout
RHadoop
H2O
MLBase
MLLIb
MADLib+
Impala
Hive
Mall
33

Yes, We Are Hiring!
Thank You!

Open Source Big Data Analytics Toolkits Comparison

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (14)

Ähnlich wie Open Source Big Data Analytics Toolkits Comparison

Ähnlich wie Open Source Big Data Analytics Toolkits Comparison (20)

Mehr von DataWorks Summit

Mehr von DataWorks Summit (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Open Source Big Data Analytics Toolkits Comparison

Hinweis der Redaktion