SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Big Data Analytics –
Open Source Toolkits
Prafulla Wani
Snehalata Deorukhkar
Introduction
 Talk Background
– More Data Beats Better Algorithms
– Evaluate “Analytics Toolkits” that support Hadoop
 Speaker Backgrounds
 Data Engineers
 No PhDs in statistics
2
Big Data Analytics Toolkits
 Evaluation parameters
– Ease of use
• Development APIs
• # of Algorithms supported
– Performance
• Scalable Architecture
• Disk-based / Memory-based
 Open-source
only
– RHadoop
– Mahout
– MADLib
– HiveMall
– H2O
– Spark-MLLib
3
Analytics Project lifecycle
Train
Model(s)
Gather
Data
Compare
Accuracy
Predict
Future
 Train Algorithm 1
(Logistic regression)
 Train Algorithm 2
(SVM)
 ......
 Train Algorithm N
4
Analytics (Pre-Hadoop era)
Performance
Ease
of
use
Single
Machine
R,
Octave
R –
 Started in 1993
 Very Popular
 5589 packages
 Written primarily in
C and Fortran
Octave –
 Started in 1988
 Open source and
features
comparable with
Matlab
5
Timeline
2014201320122008
6
2006
Hadoop
20112010
Core Hadoop
(HDFS,
MapReduce)
HBase,
Zookeeper
, Pig,
Hive…
Avro,
Sqoop
Cloudera
Impala
YARN
Architecture
R
R R R R
R R R R
Client/
Edge Node
Hadoop Cluster
Client/
Edge Node
Hadoop Cluster
RHadoop
Mahout
Mahout
Map/
Reduce
Map/
Reduce
7
Timeline
2014201320122008
8
2006
Hadoop
20112010
Core Hadoop
(HDFS,
MapReduce)
HBase,
Zookeeper
, Pig,
Hive…
Avro,
Sqoop
Cloudera
Impala
YARN
RHadoop
rhdfs, rmr rmr 2.0 plyrmr
RHadoop
 Provides R packages –
– rhdfs - to read/write from/to HDFS
– rhbase - to read/write from/to HBase
– rmr - to express map-reduce programs in R
 Does not provide out-of-box packages for
model training
9
RHadoop
logistic.regression = function(input, iterations, dims, alpha){
plane = t(rep(0, dims))
g = function(z) 1/(1 + exp(-z))
for (i in 1:iterations) {
gradient =
values( from.dfs( mapreduce(
input,
map = lr.map,
reduce = lr.reduce,
combine = T)))
plane = plane + alpha * gradient }
plane }
lr.map =
function(., M) {
Y = M[,1]
X = M[,-1]
keyval(
1, +
Y * X *
g(-Y * as.numeric(X %*% t(plane))))}
lr.reduce =
function(k, Z)
keyval(k, t(as.matrix(apply(Z,2,sum))))
10
Timeline
2014201320122008
11
2006
Hadoop
Mahout
Started as a
subproject
of Apache
Lucene
20112010
Decision to reject
new MapReduce
implementation
Future
implementations
on top of Apache
Spark
Integration with
H2O platform
Top level
apache
project
4 releases
(0.1 – 0.4)
Core Hadoop
(HDFS,
MapReduce)
HBase,
Zookeeper
, Pig,
Hive…
Mahout
Avro,
Sqoop
Cloudera
Impala
YARN
0.8
release
Recomme
ndation
Engines –
Common
Case
study for
Hadoop
RHadoop
rhdfs, rmr rmr 2.0 plyrmr
Mahout
 Original goal - To implement all 10 algorithms from Andrew
Ng's paper "Map-Reduce for Machine Learning on
Multicore"
 Java based library having MapReduce implementation of
common analytics algorithms
 Key algorithms
– Recommendation algorithms / Collaborative filtering
– Classification
– Clustering
– Frequent Pattern Growth
12
Mahout
 Train the model:
mahout org.apache.mahout.df.mapreduce.BuildForest -
Dmapred.max.split.size=1884231 -oob -d train.arff -ds
train.info -sl 5 -t 1000 -o crwd_forest
 Test the model:
mahout org.apache.mahout.df.mapreduce.TestForest -i
test.arff -ds train.info -m crwd_forest -a -mr -o
crwd_predictions
13
Summary
Performance
Ease of use
Distributed
Disk-based
Single
Machine
R,
Octave
Mahout
RHadoop
14
Aging MapReduce
 Machine learning algorithms are iterative in nature
 Mahout algorithms involve multiple MapReduce stages
 Intermediate results are written to HDFS
 MR job is launched for each iteration
 IO overhead
Input
Input
HDFS
read
HDFS
read
HDFS
write
HDFS
read
HDFS
write
iter. 1 iter. 2
Query 1
Query 2
Query 3
result 1
result 2
result 3
…
…
Slow due to replication and disk IO
15
Disk Trend
 Disk throughput increasing slowly
Reference - http://www.cs.berkeley.edu/~haoyuan/talks/Tachyon_2013-08-30_AMPCamp2013.pdf
16
Memory Trend
 RAM throughput increasing exponentially
Reference - http://www.cs.berkeley.edu/~haoyuan/talks/Tachyon_2013-08-30_AMPCamp2013.pdf
17
Timeline
2014201320122011
18
2010
Started as
a research
project at
UC
Berkley
AMPLab
2009
Open-
sourced
Accepted
into Apache
incubator
Spark 0.8
release
introduced
MLLib
Spark-MLLib
1.0 released
Wins Best
Paper
Award at
USENIX
NSDI
Spark
Spark – Data sharing
 Resilient Distributed Datasets (RDDs)
– Distributed collections of objects that can be cached in
memory across cluster nodes
– Manipulated through various parallel operations
– Automatically rebuilt on failures
Input
Input
One-time
Processing
iter. 1 iter. 2
Query 1
Query 2
Query 3
…
10-100x faster than network and disk
…
Distributed
memory
19
MLLib
 Spark implementation of some common machine
learning algorithms and utilities, including
– Classification
– Regression
– Clustering
 Pre-packaged libraries (in scala, Java, Python) for
analytics algorithms –
– val model = SVMWithSGD.train(training, numIterations)
– val clusters = KMeans.train(parsedData, numClusters,
numIterations)
20
SparkR - R Interface over Spark
 Currently supports using data transformation
functions lapply() etc. on distributed spark model
 It does not support running out of the box model (e.g.
SVMWithSGD.train or KMeans.train)
 The work is in progress on sparkR - MLLib
integration which may address this limitation
21
Timeline
2014201320122011
22
Began as a
collaboration
between
researchers,
engineers and
data scientists
2010
Initial
release
Started as
a research
project at
UC
Berkley
AMPLab
2009
Open-
sourced
Accepted
into Apache
incubator
Spark 0.8
release
introduced
MLLib
Spark-MLLib
1.0 released
Wins Best
Paper
Award at
USENIX
NSDI
Spark
MADLib
MADLib-
port for
Impala
MADLib
 An open-source library for scalable in-database analytics
 Supports Postgres, Pivotal GreenPlum Database, and
Pivotal HAWQ
 Key MADLib architecture principles are:
– Operating on the data locally-in database.
– Utilizing best of breed database engines, but separate
the machine learning logic from database specific
implementation details.
– Leveraging MPP Share nothing technology, such as the
Pivotal Greenplum Database, to provide parallelism and
scalability.
– Open implementation maintaining active ties into
ongoing academic research."
23
MADLib Architecture
24
User Interface
“Driver” Functions
(outer loops of iterative algorithms, optimizer invocations)
High – level Abstraction Layer
(iteration controller, …)
RDBMS
Built-in
functions
MPP Query Processing
(Greenplum, PostgreSQL, Impala …)
Functions for Inner Loops
(for streaming algorithms)
Low-level Abstraction Layer
(matrix operations, C++ to RDBMS type
bridge, …)
SQL, generated from
specification
C++
Timeline
2014201320122011
25
Began as a
collaboration
between
researchers,
engineers and
data scientists
2010
Initial
release
Started as
a research
project at
UC
Berkley
AMPLab
2009
Open-
sourced
Accepted
into Apache
incubator
Spark 0.8
release
introduced
MLLib
Spark-MLLib
1.0 released
Wins Best
Paper
Award at
USENIX
NSDI
Spark
H2O Project
open-
sourced
MADLib
H2O
Latest stable
release of H2O
2.4.3.4 released
on May 13, 2014
MADLib-
port for
Impala
H2O
 Open source math and prediction engine
 Distributed, in-memory computations
 Creates a cluster of H2O nodes, which are map-
only tasks
 Provides graphical interface to load-data, view
summaries and train models
 Certified for major hadoop distributions
26
H2O on Hadoop Deployment
Hadoop
H2O
Map
Task
Hadoop
H2O
Map
Task
Hadoop
H2O
Map
Task
Job
Tracker
hadoop jar …
HDFS
Hadoop edge Node
Hadoop Cluster
Hadoop Task
Tracker Nodes
(H2O Cluster)
Hadoop HDFS
Data Nodes
27
Reference - http://www.slideshare.net/0xdata/h2o-on-hadoop-dec-12
H2O Programming Interface
 R-Package “H2O”
– prostate.data = h2o.importURL(localH2O, path = “<path>”,
key = “<key>")
– summary(prostate.data)
– h2o.glm
– h2o.kmeans
28
Community involvement
Mahout Spark-MLLib MADLib H2O
# of commits 20 249 0 557
29
For 30 days ending 27 May,
HiveMall
 Machine learning and feature engineering
functions through UDFs/UDAFs/UDTFs of Hive
 Supports various algorithms for –
– Classification – Perceptron, Adaptive Regularization of
Weight Vectors (AROW)
– Regression - Logistic Regression using Stochastic
Gradient Descent
– Recommendation - Minhash (LSH with jaccard index)
– k-Nearest Neighbor
– Feature engineering
30
Summary
Performance
Ease of use
Distributed
Disk-based
Distributed
Memory-based
Single
Machine
R,
Octave
Mahout
RHadoop
H2O
MLLIb
MADLib+
Impala
Hive
Mall
31
MLBase - Vision
 Optimizer built on top of Spark & MLLib
 A Declarative Approach
 Abstracts complexities of variable & algorithm
selection
– var X = load (“als_clinical”, 2 to 10)
– var Y = load (“als_clinical”, 1)
– var (fn-model, summary) = doClassify (X , y)
Reference - http://www.slideshare.net/chaochen5496/mlllib-sparkmeetup8613finalreduced
Train
Model(s)
Gather
Data
Compare
Accuracy
Predict
Future
32
Summary
Performance
Ease of use
Distributed
Disk-based
Distributed
Memory-based
Single
Machine
R,
Octave
Mahout
RHadoop
H2O
MLBase
MLLIb
MADLib+
Impala
Hive
Mall
33
Yes, We Are Hiring!
Thank You!

Weitere ähnliche Inhalte

Was ist angesagt?

Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...Databricks
 
Fast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL ReleasesFast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL ReleasesDataWorks Summit
 
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaBest Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaDatabricks
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2Mohit Garg
 
DASK and Apache Spark
DASK and Apache SparkDASK and Apache Spark
DASK and Apache SparkDatabricks
 
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Databricks
 
Distributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkDistributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkSpark Summit
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkAlpine Data
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on HadoopMapR Technologies
 
ROCm and Distributed Deep Learning on Spark and TensorFlow
ROCm and Distributed Deep Learning on Spark and TensorFlowROCm and Distributed Deep Learning on Spark and TensorFlow
ROCm and Distributed Deep Learning on Spark and TensorFlowDatabricks
 
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...Databricks
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Wee Hyong Tok
 
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleGPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleSpark Summit
 
Stories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi TorresStories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi TorresSpark Summit
 
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...Databricks
 
Ray and Its Growing Ecosystem
Ray and Its Growing EcosystemRay and Its Growing Ecosystem
Ray and Its Growing EcosystemDatabricks
 
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep LearningApache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep LearningDataWorks Summit
 
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...Srivatsan Ramanujam
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Databricks
 

Was ist angesagt? (20)

Spark 101
Spark 101Spark 101
Spark 101
 
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
 
Fast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL ReleasesFast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL Releases
 
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaBest Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and Delta
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2
 
DASK and Apache Spark
DASK and Apache SparkDASK and Apache Spark
DASK and Apache Spark
 
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
 
Distributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkDistributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On Spark
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 
ROCm and Distributed Deep Learning on Spark and TensorFlow
ROCm and Distributed Deep Learning on Spark and TensorFlowROCm and Distributed Deep Learning on Spark and TensorFlow
ROCm and Distributed Deep Learning on Spark and TensorFlow
 
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425
 
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleGPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
 
Stories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi TorresStories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi Torres
 
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
 
Ray and Its Growing Ecosystem
Ray and Its Growing EcosystemRay and Its Growing Ecosystem
Ray and Its Growing Ecosystem
 
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep LearningApache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
 
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
 

Andere mochten auch

Introducing Stig: A New Open Source, Non-relational, Distributed Graph Databa...
Introducing Stig: A New Open Source, Non-relational, Distributed Graph Databa...Introducing Stig: A New Open Source, Non-relational, Distributed Graph Databa...
Introducing Stig: A New Open Source, Non-relational, Distributed Graph Databa...DATAVERSITY
 
H2O on Hadoop Dec 12
H2O on Hadoop Dec 12 H2O on Hadoop Dec 12
H2O on Hadoop Dec 12 Sri Ambati
 
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Revolution Analytics
 
Sri Ambati – CEO, 0xdata at MLconf ATL
Sri Ambati – CEO, 0xdata at MLconf ATLSri Ambati – CEO, 0xdata at MLconf ATL
Sri Ambati – CEO, 0xdata at MLconf ATLMLconf
 
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....Jeffrey Breen
 
H2O Big Data Environments
H2O Big Data EnvironmentsH2O Big Data Environments
H2O Big Data EnvironmentsSri Ambati
 
The Importance of MDM - Eternal Management of the Data Mind
The Importance of MDM - Eternal Management of the Data MindThe Importance of MDM - Eternal Management of the Data Mind
The Importance of MDM - Eternal Management of the Data MindDATAVERSITY
 
Hadoop and Machine Learning
Hadoop and Machine LearningHadoop and Machine Learning
Hadoop and Machine Learningjoshwills
 

Andere mochten auch (14)

Green datacenters
Green datacentersGreen datacenters
Green datacenters
 
Introducing Stig: A New Open Source, Non-relational, Distributed Graph Databa...
Introducing Stig: A New Open Source, Non-relational, Distributed Graph Databa...Introducing Stig: A New Open Source, Non-relational, Distributed Graph Databa...
Introducing Stig: A New Open Source, Non-relational, Distributed Graph Databa...
 
H2O on Hadoop Dec 12
H2O on Hadoop Dec 12 H2O on Hadoop Dec 12
H2O on Hadoop Dec 12
 
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
 
A Continuously Deployed Hadoop Analytics Platform?
A Continuously Deployed Hadoop Analytics Platform?A Continuously Deployed Hadoop Analytics Platform?
A Continuously Deployed Hadoop Analytics Platform?
 
Sri Ambati – CEO, 0xdata at MLconf ATL
Sri Ambati – CEO, 0xdata at MLconf ATLSri Ambati – CEO, 0xdata at MLconf ATL
Sri Ambati – CEO, 0xdata at MLconf ATL
 
Using R with Hadoop
Using R with HadoopUsing R with Hadoop
Using R with Hadoop
 
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
 
H2O Big Data Environments
H2O Big Data EnvironmentsH2O Big Data Environments
H2O Big Data Environments
 
Big data with r
Big data with rBig data with r
Big data with r
 
The Importance of MDM - Eternal Management of the Data Mind
The Importance of MDM - Eternal Management of the Data MindThe Importance of MDM - Eternal Management of the Data Mind
The Importance of MDM - Eternal Management of the Data Mind
 
Hadoop and Machine Learning
Hadoop and Machine LearningHadoop and Machine Learning
Hadoop and Machine Learning
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
FOG COMPUTING
FOG COMPUTINGFOG COMPUTING
FOG COMPUTING
 

Ähnlich wie Open Source Big Data Analytics Toolkits Comparison

Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkManish Gupta
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonVitthal Gogate
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overviewMartin Zapletal
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsGeoffrey Fox
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsGeoffrey Fox
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scalesamthemonad
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19Ahmed Elsayed
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...spinningmatt
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkDatabricks
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkVenkata Naga Ravi
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Srivatsan Ramanujam
 

Ähnlich wie Open Source Big Data Analytics Toolkits Comparison (20)

Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 
Tutorial5
Tutorial5Tutorial5
Tutorial5
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
Data Science
Data ScienceData Science
Data Science
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
 
Spark vs Hadoop
Spark vs HadoopSpark vs Hadoop
Spark vs Hadoop
 

Mehr von DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mehr von DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Kürzlich hochgeladen

Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Deliver Latency Free Customer Experience
Deliver Latency Free Customer ExperienceDeliver Latency Free Customer Experience
Deliver Latency Free Customer ExperienceOpsTree solutions
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
QMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfQMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfROWELL MARQUINA
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 
Laying the Data Foundations for Artificial Intelligence!
Laying the Data Foundations for Artificial Intelligence!Laying the Data Foundations for Artificial Intelligence!
Laying the Data Foundations for Artificial Intelligence!Memoori
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessWSO2
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 

Kürzlich hochgeladen (20)

Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Deliver Latency Free Customer Experience
Deliver Latency Free Customer ExperienceDeliver Latency Free Customer Experience
Deliver Latency Free Customer Experience
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
QMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfQMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdf
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 
Laying the Data Foundations for Artificial Intelligence!
Laying the Data Foundations for Artificial Intelligence!Laying the Data Foundations for Artificial Intelligence!
Laying the Data Foundations for Artificial Intelligence!
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with Platformless
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDF
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 

Open Source Big Data Analytics Toolkits Comparison

  • 1. Big Data Analytics – Open Source Toolkits Prafulla Wani Snehalata Deorukhkar
  • 2. Introduction  Talk Background – More Data Beats Better Algorithms – Evaluate “Analytics Toolkits” that support Hadoop  Speaker Backgrounds  Data Engineers  No PhDs in statistics 2
  • 3. Big Data Analytics Toolkits  Evaluation parameters – Ease of use • Development APIs • # of Algorithms supported – Performance • Scalable Architecture • Disk-based / Memory-based  Open-source only – RHadoop – Mahout – MADLib – HiveMall – H2O – Spark-MLLib 3
  • 4. Analytics Project lifecycle Train Model(s) Gather Data Compare Accuracy Predict Future  Train Algorithm 1 (Logistic regression)  Train Algorithm 2 (SVM)  ......  Train Algorithm N 4
  • 5. Analytics (Pre-Hadoop era) Performance Ease of use Single Machine R, Octave R –  Started in 1993  Very Popular  5589 packages  Written primarily in C and Fortran Octave –  Started in 1988  Open source and features comparable with Matlab 5
  • 7. Architecture R R R R R R R R R Client/ Edge Node Hadoop Cluster Client/ Edge Node Hadoop Cluster RHadoop Mahout Mahout Map/ Reduce Map/ Reduce 7
  • 9. RHadoop  Provides R packages – – rhdfs - to read/write from/to HDFS – rhbase - to read/write from/to HBase – rmr - to express map-reduce programs in R  Does not provide out-of-box packages for model training 9
  • 10. RHadoop logistic.regression = function(input, iterations, dims, alpha){ plane = t(rep(0, dims)) g = function(z) 1/(1 + exp(-z)) for (i in 1:iterations) { gradient = values( from.dfs( mapreduce( input, map = lr.map, reduce = lr.reduce, combine = T))) plane = plane + alpha * gradient } plane } lr.map = function(., M) { Y = M[,1] X = M[,-1] keyval( 1, + Y * X * g(-Y * as.numeric(X %*% t(plane))))} lr.reduce = function(k, Z) keyval(k, t(as.matrix(apply(Z,2,sum)))) 10
  • 11. Timeline 2014201320122008 11 2006 Hadoop Mahout Started as a subproject of Apache Lucene 20112010 Decision to reject new MapReduce implementation Future implementations on top of Apache Spark Integration with H2O platform Top level apache project 4 releases (0.1 – 0.4) Core Hadoop (HDFS, MapReduce) HBase, Zookeeper , Pig, Hive… Mahout Avro, Sqoop Cloudera Impala YARN 0.8 release Recomme ndation Engines – Common Case study for Hadoop RHadoop rhdfs, rmr rmr 2.0 plyrmr
  • 12. Mahout  Original goal - To implement all 10 algorithms from Andrew Ng's paper "Map-Reduce for Machine Learning on Multicore"  Java based library having MapReduce implementation of common analytics algorithms  Key algorithms – Recommendation algorithms / Collaborative filtering – Classification – Clustering – Frequent Pattern Growth 12
  • 13. Mahout  Train the model: mahout org.apache.mahout.df.mapreduce.BuildForest - Dmapred.max.split.size=1884231 -oob -d train.arff -ds train.info -sl 5 -t 1000 -o crwd_forest  Test the model: mahout org.apache.mahout.df.mapreduce.TestForest -i test.arff -ds train.info -m crwd_forest -a -mr -o crwd_predictions 13
  • 15. Aging MapReduce  Machine learning algorithms are iterative in nature  Mahout algorithms involve multiple MapReduce stages  Intermediate results are written to HDFS  MR job is launched for each iteration  IO overhead Input Input HDFS read HDFS read HDFS write HDFS read HDFS write iter. 1 iter. 2 Query 1 Query 2 Query 3 result 1 result 2 result 3 … … Slow due to replication and disk IO 15
  • 16. Disk Trend  Disk throughput increasing slowly Reference - http://www.cs.berkeley.edu/~haoyuan/talks/Tachyon_2013-08-30_AMPCamp2013.pdf 16
  • 17. Memory Trend  RAM throughput increasing exponentially Reference - http://www.cs.berkeley.edu/~haoyuan/talks/Tachyon_2013-08-30_AMPCamp2013.pdf 17
  • 18. Timeline 2014201320122011 18 2010 Started as a research project at UC Berkley AMPLab 2009 Open- sourced Accepted into Apache incubator Spark 0.8 release introduced MLLib Spark-MLLib 1.0 released Wins Best Paper Award at USENIX NSDI Spark
  • 19. Spark – Data sharing  Resilient Distributed Datasets (RDDs) – Distributed collections of objects that can be cached in memory across cluster nodes – Manipulated through various parallel operations – Automatically rebuilt on failures Input Input One-time Processing iter. 1 iter. 2 Query 1 Query 2 Query 3 … 10-100x faster than network and disk … Distributed memory 19
  • 20. MLLib  Spark implementation of some common machine learning algorithms and utilities, including – Classification – Regression – Clustering  Pre-packaged libraries (in scala, Java, Python) for analytics algorithms – – val model = SVMWithSGD.train(training, numIterations) – val clusters = KMeans.train(parsedData, numClusters, numIterations) 20
  • 21. SparkR - R Interface over Spark  Currently supports using data transformation functions lapply() etc. on distributed spark model  It does not support running out of the box model (e.g. SVMWithSGD.train or KMeans.train)  The work is in progress on sparkR - MLLib integration which may address this limitation 21
  • 22. Timeline 2014201320122011 22 Began as a collaboration between researchers, engineers and data scientists 2010 Initial release Started as a research project at UC Berkley AMPLab 2009 Open- sourced Accepted into Apache incubator Spark 0.8 release introduced MLLib Spark-MLLib 1.0 released Wins Best Paper Award at USENIX NSDI Spark MADLib MADLib- port for Impala
  • 23. MADLib  An open-source library for scalable in-database analytics  Supports Postgres, Pivotal GreenPlum Database, and Pivotal HAWQ  Key MADLib architecture principles are: – Operating on the data locally-in database. – Utilizing best of breed database engines, but separate the machine learning logic from database specific implementation details. – Leveraging MPP Share nothing technology, such as the Pivotal Greenplum Database, to provide parallelism and scalability. – Open implementation maintaining active ties into ongoing academic research." 23
  • 24. MADLib Architecture 24 User Interface “Driver” Functions (outer loops of iterative algorithms, optimizer invocations) High – level Abstraction Layer (iteration controller, …) RDBMS Built-in functions MPP Query Processing (Greenplum, PostgreSQL, Impala …) Functions for Inner Loops (for streaming algorithms) Low-level Abstraction Layer (matrix operations, C++ to RDBMS type bridge, …) SQL, generated from specification C++
  • 25. Timeline 2014201320122011 25 Began as a collaboration between researchers, engineers and data scientists 2010 Initial release Started as a research project at UC Berkley AMPLab 2009 Open- sourced Accepted into Apache incubator Spark 0.8 release introduced MLLib Spark-MLLib 1.0 released Wins Best Paper Award at USENIX NSDI Spark H2O Project open- sourced MADLib H2O Latest stable release of H2O 2.4.3.4 released on May 13, 2014 MADLib- port for Impala
  • 26. H2O  Open source math and prediction engine  Distributed, in-memory computations  Creates a cluster of H2O nodes, which are map- only tasks  Provides graphical interface to load-data, view summaries and train models  Certified for major hadoop distributions 26
  • 27. H2O on Hadoop Deployment Hadoop H2O Map Task Hadoop H2O Map Task Hadoop H2O Map Task Job Tracker hadoop jar … HDFS Hadoop edge Node Hadoop Cluster Hadoop Task Tracker Nodes (H2O Cluster) Hadoop HDFS Data Nodes 27 Reference - http://www.slideshare.net/0xdata/h2o-on-hadoop-dec-12
  • 28. H2O Programming Interface  R-Package “H2O” – prostate.data = h2o.importURL(localH2O, path = “<path>”, key = “<key>") – summary(prostate.data) – h2o.glm – h2o.kmeans 28
  • 29. Community involvement Mahout Spark-MLLib MADLib H2O # of commits 20 249 0 557 29 For 30 days ending 27 May,
  • 30. HiveMall  Machine learning and feature engineering functions through UDFs/UDAFs/UDTFs of Hive  Supports various algorithms for – – Classification – Perceptron, Adaptive Regularization of Weight Vectors (AROW) – Regression - Logistic Regression using Stochastic Gradient Descent – Recommendation - Minhash (LSH with jaccard index) – k-Nearest Neighbor – Feature engineering 30
  • 32. MLBase - Vision  Optimizer built on top of Spark & MLLib  A Declarative Approach  Abstracts complexities of variable & algorithm selection – var X = load (“als_clinical”, 2 to 10) – var Y = load (“als_clinical”, 1) – var (fn-model, summary) = doClassify (X , y) Reference - http://www.slideshare.net/chaochen5496/mlllib-sparkmeetup8613finalreduced Train Model(s) Gather Data Compare Accuracy Predict Future 32
  • 34. Yes, We Are Hiring! Thank You!

Hinweis der Redaktion

  1. Gather Data – Exploratory Analytics, Variable selection, Dimensionality Reduction - PCA, SVD Gather Data Train Model Compare Model Performance – AUC Curve etc. Predict the future
  2. Now let us understand how analytics was done in pre-hadoop era. Tools like R and octave were used widely which run on a single machine and give fair enough performance with small dataset. Both R and Octave are open source high level interpreted languages. R started in 1993 .It is mainly written in C and Fortran. R is a very popular tool among statisticians and data scientists for performing computational statistics, visualization and data science. It has a vibrant community noted for its active contributions in terms of packages. It has 5589 packages. Octave is also an open source, high level interpreted language. The octave language is quite similar to Matlab so that most programs are easily portable. But both of these languages have limitations in terms of volume of data that can be handled and are not suitable for analytics on huge and dynamic data sets.Hadoop is a defacto standard for storing and processing huge volume of data.
  3. Hadoop was started by Doug Cutting for Nutch project at Yahoo.Till 2007 it had two core components – HDFS and MapReduce.In 2008, tools like Hbase,ZooKeeper were added in the hadoop ecosystem. In 2010 Avro and sqoop were added and the ecosystem is still growing. Two main tools –Rhadoop and Mahout were developed to leverage the distributed processing of the Hadoop framework. Intoduction of yarn… it opens hadoop framework for many other frameworks beyong mapreduce/ Rhadoop? Rhipe?? 2012?
  4. RHadoop is an open source collection of three R packages that allow users to manage and analyze data with Hadoop from R environment. . R along with R-Hadoop packages needs to be installed on all the nodes including the edge node. And the RHadoop will submit the job from the client/edge node. Mahout is a java library having mapreduce implementation of machine learning algorithms. In case of mahout, only mahout library needs to be present on the client/edge node and the Mahout job will be submitted, which will be an MR job for distributed algorithms to Hadoop cluster R along with R-Hadoop, RHipe packages needs to be installed on all the nodes including the edge node. And the Rhadoop/Rhipe will submit the job from the client/edge node In case of mahout, only mahout library needs to be present on the client/edge node and the Mahout job will be submitted, which will be an MR job for distributed algorithms to Hadoop cluster.
  5. Rhadoop? Rhipe ?? 2012? Plurmr – provides additional data manipulation cpabilities
  6. Rhadoop consists of the following packages: • rmr2 -functions providing Hadoop MapReduce functionality in R • rhdfs -functions providing file management of the HDFS from within R • rhbase -functions providing database management for the Hbase distributed database from within R
  7. This is a sample code for logistic regression in Rhadoop. Logistic regression avaiable in R can not be reused
  8. Rhadoop? Rhipe?? 2012? We saw adoption of mahout based recommendation engine across the industry…
  9. Mahout is a java library having MR implementation of common machine learning algorithms.It was developed to provide scalable and parallelized machine learning algorithms based on Hadoop framework.The original aim of the Mahout project was to implement all 10 alogorithms discussed in Andrew Ng’s paper “Mapreduce …. “
  10. One of the reason why Map Reduced is criticized is – Restricted programming framework - MapReduce tasks must be written as acyclic dataflow programs - Stateless mapper followed by a stateless reducer, that are executed by a batch job scheduler - Repeated querying of datasets become difficult - thus hard to write iterative algorithms - After each iteration of Map-Reduce, data has to be persisted on disc for next iteration to proceed with processing.
  11. MADlib grew out of discussions between database engine developers, data scientists, IT architects, and academics interested in new approaches to scalable, sophisticated in-database analytics. Their exchanges were written up in a paper in VLDB 2009 that coined the term "MAD Skills" for data analysis. MADLIB project began in 2010 as a collaboration between researchers at UC Berkeley and engineers and data scientists at Pivotal, formerly Greenplum and today it also includes researchers from Stanford and University of Florida. Latest version 1.5 MADlib’s Initial release included : Naive Bayes ,k-means, svm, quantile, linear and logistic regression, matrix factorization
  12. Spark started as a research project at the UC Berkeley AMPLab in 2009, and was open sourced in early 2010. The AMPLab continues to perform research on both improving Spark and on systems built on top it. After being released, Spark grew a developer community on GitHub and entered Apache in 2013 as its permanent home. A wide range of contributors now develop the project (over 120 developers from 25 companies).MLlib is developed as part of the Apache Spark project. It thus gets tested and updated with each Spark release. Spark top level apache project in Feb,2014 Current version 1.0 Included SVM, logistic regression, K-means, ALS Hadoop YARN support in Spark
  13. MADlib grew out of discussions between database engine developers, data scientists, IT architects, and academics interested in new approaches to scalable, sophisticated in-database analytics. Their exchanges were written up in a paper in VLDB 2009 that coined the term "MAD Skills" for data analysis. MADLIB project began in 2010 as a collaboration between researchers at UC Berkeley and engineers and data scientists at Pivotal, formerly Greenplum and today it also includes researchers from Stanford and University of Florida. Latest version 1.5 MADlib’s Initial release included : Naive Bayes ,k-means, svm, quantile, linear and logistic regression, matrix factorization
  14. MADlib grew out of discussions between database-engine developers, data scientists, IT architects and academics interested in new approaches to scalable, sophisticated in-database analytics. These discussions were written up in a paper in VLDB 2009 that coined the term “MAD Skills” for data analysis. The MADlib software project began the following year as a collaboration between researchers at UC Berkeley and engineers and data scientists at EMC/Greenplum (later Pivotal). Today it also includes researchers from Stanford and University of Florida. Latest version 1.5 Algorithms Supported Classification Naive Bayes Classification , Random Forest Regression Logistic Regression, Linear Regression, Multinomial logistic regression, Elastic net regularization Clustering K-Means Topic Modeling Latent Dirichlet Allocation etc. Association Rule Mining Apriori
  15. MADlib grew out of discussions between database engine developers, data scientists, IT architects, and academics interested in new approaches to scalable, sophisticated in-database analytics. Their exchanges were written up in a paper in VLDB 2009 that coined the term "MAD Skills" for data analysis. MADLIB project began in 2010 as a collaboration between researchers at UC Berkeley and engineers and data scientists at Pivotal, formerly Greenplum and today it also includes researchers from Stanford and University of Florida. Latest version 1.5 MADlib’s Initial release included : Naive Bayes ,k-means, svm, quantile, linear and logistic regression, matrix factorization