The document summarizes several open source big data analytics toolkits that support Hadoop, including RHadoop, Mahout, MADLib, HiveMall, H2O, and Spark-MLLib. It describes each toolkit's key features such as algorithms supported, performance, ease of use, and architecture. Spark-MLLib provides machine learning algorithms in a distributed in-memory framework for improved performance compared to disk-based approaches. MADLib and H2O also operate directly on data in memory for faster iterative modeling.
Open Source Big Data Analytics Toolkits Comparison
1. Big Data Analytics –
Open Source Toolkits
Prafulla Wani
Snehalata Deorukhkar
2. Introduction
Talk Background
– More Data Beats Better Algorithms
– Evaluate “Analytics Toolkits” that support Hadoop
Speaker Backgrounds
Data Engineers
No PhDs in statistics
2
3. Big Data Analytics Toolkits
Evaluation parameters
– Ease of use
• Development APIs
• # of Algorithms supported
– Performance
• Scalable Architecture
• Disk-based / Memory-based
Open-source
only
– RHadoop
– Mahout
– MADLib
– HiveMall
– H2O
– Spark-MLLib
3
9. RHadoop
Provides R packages –
– rhdfs - to read/write from/to HDFS
– rhbase - to read/write from/to HBase
– rmr - to express map-reduce programs in R
Does not provide out-of-box packages for
model training
9
10. RHadoop
logistic.regression = function(input, iterations, dims, alpha){
plane = t(rep(0, dims))
g = function(z) 1/(1 + exp(-z))
for (i in 1:iterations) {
gradient =
values( from.dfs( mapreduce(
input,
map = lr.map,
reduce = lr.reduce,
combine = T)))
plane = plane + alpha * gradient }
plane }
lr.map =
function(., M) {
Y = M[,1]
X = M[,-1]
keyval(
1, +
Y * X *
g(-Y * as.numeric(X %*% t(plane))))}
lr.reduce =
function(k, Z)
keyval(k, t(as.matrix(apply(Z,2,sum))))
10
11. Timeline
2014201320122008
11
2006
Hadoop
Mahout
Started as a
subproject
of Apache
Lucene
20112010
Decision to reject
new MapReduce
implementation
Future
implementations
on top of Apache
Spark
Integration with
H2O platform
Top level
apache
project
4 releases
(0.1 – 0.4)
Core Hadoop
(HDFS,
MapReduce)
HBase,
Zookeeper
, Pig,
Hive…
Mahout
Avro,
Sqoop
Cloudera
Impala
YARN
0.8
release
Recomme
ndation
Engines –
Common
Case
study for
Hadoop
RHadoop
rhdfs, rmr rmr 2.0 plyrmr
12. Mahout
Original goal - To implement all 10 algorithms from Andrew
Ng's paper "Map-Reduce for Machine Learning on
Multicore"
Java based library having MapReduce implementation of
common analytics algorithms
Key algorithms
– Recommendation algorithms / Collaborative filtering
– Classification
– Clustering
– Frequent Pattern Growth
12
13. Mahout
Train the model:
mahout org.apache.mahout.df.mapreduce.BuildForest -
Dmapred.max.split.size=1884231 -oob -d train.arff -ds
train.info -sl 5 -t 1000 -o crwd_forest
Test the model:
mahout org.apache.mahout.df.mapreduce.TestForest -i
test.arff -ds train.info -m crwd_forest -a -mr -o
crwd_predictions
13
15. Aging MapReduce
Machine learning algorithms are iterative in nature
Mahout algorithms involve multiple MapReduce stages
Intermediate results are written to HDFS
MR job is launched for each iteration
IO overhead
Input
Input
HDFS
read
HDFS
read
HDFS
write
HDFS
read
HDFS
write
iter. 1 iter. 2
Query 1
Query 2
Query 3
result 1
result 2
result 3
…
…
Slow due to replication and disk IO
15
16. Disk Trend
Disk throughput increasing slowly
Reference - http://www.cs.berkeley.edu/~haoyuan/talks/Tachyon_2013-08-30_AMPCamp2013.pdf
16
19. Spark – Data sharing
Resilient Distributed Datasets (RDDs)
– Distributed collections of objects that can be cached in
memory across cluster nodes
– Manipulated through various parallel operations
– Automatically rebuilt on failures
Input
Input
One-time
Processing
iter. 1 iter. 2
Query 1
Query 2
Query 3
…
10-100x faster than network and disk
…
Distributed
memory
19
20. MLLib
Spark implementation of some common machine
learning algorithms and utilities, including
– Classification
– Regression
– Clustering
Pre-packaged libraries (in scala, Java, Python) for
analytics algorithms –
– val model = SVMWithSGD.train(training, numIterations)
– val clusters = KMeans.train(parsedData, numClusters,
numIterations)
20
21. SparkR - R Interface over Spark
Currently supports using data transformation
functions lapply() etc. on distributed spark model
It does not support running out of the box model (e.g.
SVMWithSGD.train or KMeans.train)
The work is in progress on sparkR - MLLib
integration which may address this limitation
21
22. Timeline
2014201320122011
22
Began as a
collaboration
between
researchers,
engineers and
data scientists
2010
Initial
release
Started as
a research
project at
UC
Berkley
AMPLab
2009
Open-
sourced
Accepted
into Apache
incubator
Spark 0.8
release
introduced
MLLib
Spark-MLLib
1.0 released
Wins Best
Paper
Award at
USENIX
NSDI
Spark
MADLib
MADLib-
port for
Impala
23. MADLib
An open-source library for scalable in-database analytics
Supports Postgres, Pivotal GreenPlum Database, and
Pivotal HAWQ
Key MADLib architecture principles are:
– Operating on the data locally-in database.
– Utilizing best of breed database engines, but separate
the machine learning logic from database specific
implementation details.
– Leveraging MPP Share nothing technology, such as the
Pivotal Greenplum Database, to provide parallelism and
scalability.
– Open implementation maintaining active ties into
ongoing academic research."
23
24. MADLib Architecture
24
User Interface
“Driver” Functions
(outer loops of iterative algorithms, optimizer invocations)
High – level Abstraction Layer
(iteration controller, …)
RDBMS
Built-in
functions
MPP Query Processing
(Greenplum, PostgreSQL, Impala …)
Functions for Inner Loops
(for streaming algorithms)
Low-level Abstraction Layer
(matrix operations, C++ to RDBMS type
bridge, …)
SQL, generated from
specification
C++
25. Timeline
2014201320122011
25
Began as a
collaboration
between
researchers,
engineers and
data scientists
2010
Initial
release
Started as
a research
project at
UC
Berkley
AMPLab
2009
Open-
sourced
Accepted
into Apache
incubator
Spark 0.8
release
introduced
MLLib
Spark-MLLib
1.0 released
Wins Best
Paper
Award at
USENIX
NSDI
Spark
H2O Project
open-
sourced
MADLib
H2O
Latest stable
release of H2O
2.4.3.4 released
on May 13, 2014
MADLib-
port for
Impala
26. H2O
Open source math and prediction engine
Distributed, in-memory computations
Creates a cluster of H2O nodes, which are map-
only tasks
Provides graphical interface to load-data, view
summaries and train models
Certified for major hadoop distributions
26
32. MLBase - Vision
Optimizer built on top of Spark & MLLib
A Declarative Approach
Abstracts complexities of variable & algorithm
selection
– var X = load (“als_clinical”, 2 to 10)
– var Y = load (“als_clinical”, 1)
– var (fn-model, summary) = doClassify (X , y)
Reference - http://www.slideshare.net/chaochen5496/mlllib-sparkmeetup8613finalreduced
Train
Model(s)
Gather
Data
Compare
Accuracy
Predict
Future
32
Gather Data – Exploratory Analytics, Variable selection, Dimensionality Reduction - PCA, SVD
Gather Data
Train Model
Compare Model Performance – AUC Curve etc.
Predict the future
Now let us understand how analytics was done in pre-hadoop era. Tools like R and octave were used widely which run on a single machine and give fair enough performance with small dataset. Both R and Octave are open source high level interpreted languages. R started in 1993 .It is mainly written in C and Fortran. R is a very popular tool among statisticians and data scientists for performing computational statistics, visualization and data science. It has a vibrant community noted for its active contributions in terms of packages. It has 5589 packages.
Octave is also an open source, high level interpreted language. The octave language is quite similar to Matlab so that most programs are easily portable.
But both of these languages have limitations in terms of volume of data that can be handled and are not suitable for analytics on huge and dynamic data sets.Hadoop is a defacto standard for storing and processing huge volume of data.
Hadoop was started by Doug Cutting for Nutch project at Yahoo.Till 2007 it had two core components – HDFS and MapReduce.In 2008, tools like Hbase,ZooKeeper were added in the hadoop ecosystem. In 2010 Avro and sqoop were added and the ecosystem is still growing.
Two main tools –Rhadoop and Mahout were developed to leverage the distributed processing of the Hadoop framework.
Intoduction of yarn… it opens hadoop framework for many other frameworks beyong mapreduce/
Rhadoop?
Rhipe?? 2012?
RHadoop is an open source collection of three R packages that allow users to manage and analyze data with Hadoop from R environment.
. R along with R-Hadoop packages needs to be installed on all the nodes including the edge node. And the RHadoop will submit the job from the client/edge node.
Mahout is a java library having mapreduce implementation of machine learning algorithms. In case of mahout, only mahout library needs to be present on the client/edge node and the Mahout job will be submitted, which will be an MR job for distributed algorithms to Hadoop cluster
R along with R-Hadoop, RHipe packages needs to be installed on all the nodes including the edge node. And the Rhadoop/Rhipe will submit the job from the client/edge node
In case of mahout, only mahout library needs to be present on the client/edge node and the Mahout job will be submitted, which will be an MR job for distributed algorithms to Hadoop cluster.
Rhadoop consists of the following packages:
• rmr2 -functions providing Hadoop MapReduce functionality in R
• rhdfs -functions providing file management of the HDFS from within R
• rhbase -functions providing database management for the Hbase distributed database from within R
This is a sample code for logistic regression in Rhadoop.
Logistic regression avaiable in R can not be reused
Rhadoop?
Rhipe?? 2012?
We saw adoption of mahout based recommendation engine across the industry…
Mahout is a java library having MR implementation of common machine learning algorithms.It was developed to provide scalable and parallelized machine learning algorithms based on Hadoop framework.The original aim of the Mahout project was to implement all 10 alogorithms discussed in Andrew Ng’s paper “Mapreduce …. “
One of the reason why Map Reduced is criticized is – Restricted programming framework
- MapReduce tasks must be written as acyclic dataflow programs
- Stateless mapper followed by a stateless reducer, that are executed by a batch job scheduler
- Repeated querying of datasets become difficult
- thus hard to write iterative algorithms
- After each iteration of Map-Reduce, data has to be persisted on disc for next iteration to proceed with processing.
MADlib grew out of discussions between database engine developers, data scientists, IT architects, and academics interested in new approaches to scalable, sophisticated in-database analytics. Their exchanges were written up in a paper in VLDB 2009 that coined the term "MAD Skills" for data analysis.
MADLIB project began in 2010 as a collaboration between researchers at UC Berkeley and engineers and data scientists at Pivotal, formerly Greenplum and today it also includes researchers from Stanford and University of Florida.
Latest version 1.5
MADlib’s Initial release included : Naive Bayes ,k-means, svm, quantile, linear and logistic regression, matrix factorization
Spark started as a research project at the UC Berkeley AMPLab in 2009, and was open sourced in early 2010. The AMPLab continues to perform research on both improving Spark and on systems built on top it.
After being released, Spark grew a developer community on GitHub and entered Apache in 2013 as its permanent home. A wide range of contributors now develop the project (over 120 developers from 25 companies).MLlib is developed as part of the Apache Spark project. It thus gets tested and updated with each Spark release.
Spark top level apache project in Feb,2014
Current version 1.0
Included SVM, logistic regression, K-means, ALS
Hadoop YARN support in Spark
MADlib grew out of discussions between database engine developers, data scientists, IT architects, and academics interested in new approaches to scalable, sophisticated in-database analytics. Their exchanges were written up in a paper in VLDB 2009 that coined the term "MAD Skills" for data analysis.
MADLIB project began in 2010 as a collaboration between researchers at UC Berkeley and engineers and data scientists at Pivotal, formerly Greenplum and today it also includes researchers from Stanford and University of Florida.
Latest version 1.5
MADlib’s Initial release included : Naive Bayes ,k-means, svm, quantile, linear and logistic regression, matrix factorization
MADlib grew out of discussions between database-engine developers, data scientists, IT architects and academics interested in new approaches to scalable, sophisticated in-database analytics. These discussions were written up in a paper in VLDB 2009 that coined the term “MAD Skills” for data analysis. The MADlib software project began the following year as a collaboration between researchers at UC Berkeley and engineers and data scientists at EMC/Greenplum (later Pivotal).
Today it also includes researchers from Stanford and University of Florida.
Latest version 1.5
Algorithms Supported
Classification
Naive Bayes Classification , Random Forest
Regression
Logistic Regression, Linear Regression, Multinomial logistic regression, Elastic net regularization
Clustering
K-Means
Topic Modeling
Latent Dirichlet Allocation etc.
Association Rule Mining
Apriori
MADlib grew out of discussions between database engine developers, data scientists, IT architects, and academics interested in new approaches to scalable, sophisticated in-database analytics. Their exchanges were written up in a paper in VLDB 2009 that coined the term "MAD Skills" for data analysis.
MADLIB project began in 2010 as a collaboration between researchers at UC Berkeley and engineers and data scientists at Pivotal, formerly Greenplum and today it also includes researchers from Stanford and University of Florida.
Latest version 1.5
MADlib’s Initial release included : Naive Bayes ,k-means, svm, quantile, linear and logistic regression, matrix factorization