SlideShare ist ein Scribd-Unternehmen logo
1 von 55
Downloaden Sie, um offline zu lesen
Recent Developments in
Spark MLlib and Beyond
Xiangrui Meng
What is MLlib?
What is MLlib?
MLlib is an Apache Spark component focusing on
machine learning:!
• initial contribution from AMPLab, UC Berkeley
• shipped with Spark since version 0.8 (Sep 2013)
• 40 contributors
3
What is MLlib?
Algorithms:!
• classification: logistic regression, linear support vector machine
(SVM), naive Bayes, classification tree
• regression: generalized linear models (GLMs), regression tree
• collaborative filtering: alternating least squares (ALS), non-negative
matrix factorization (NMF)
• clustering: k-means
• decomposition: singular value decomposition (SVD), principal
component analysis (PCA)
4
Why MLlib?
scikit-learn
Algorithms:!
• classification: SVM, nearest neighbors, random forest, …
• regression: support vector regression (SVR), ridge
regression, Lasso, logistic regression, …!
• clustering: k-means, spectral clustering, …
• decomposition: PCA, non-negative matrix factorization
(NMF), independent component analysis (ICA), …
6
It is built on Apache Spark, a fast and general engine
for large-scale data processing.
• Run programs up to 100x faster than 

Hadoop MapReduce in memory, or 

10x faster on disk.
• Write applications quickly in Java, Scala, or Python.
Why MLlib?
7
Spark philosophy
Make life easy and productive for data scientists:
• well documented, expressive APIs
• powerful domain specific libraries
• easy integration with storage systems
• … and caching to avoid data movement
8
Word count (Scala)
val counts = sc.textFile("hdfs://...")
.flatMap(line => line.split(" "))
.map(word => (word, 1L))
.reduceByKey(_ + _)
9
Gradient descent
val points = spark.textFile(...).map(parsePoint).cache()
var w = Vector.zeros(d)
for (i <- 1 to numIterations) {
  val gradient = points.map { p =>
    (1 / (1 + exp(-p.y * w.dot(p.x)) - 1) * p.y * p.x
  ).reduce(_ + _)
  w -= alpha * gradient
}
w w ↵ ·
nX
i=1
g(w; xi, yi)
10
K-means (scala)
// Load and parse the data.
val data = sc.textFile("kmeans_data.txt")
val parsedData = data.map(_.split(‘ ').map(_.toDouble)).cache()
!
// Cluster the data into two classes using KMeans.
val clusters = KMeans.train(parsedData, 2, numIterations = 20)
!
// Compute the sum of squared errors.
val cost = clusters.computeCost(parsedData)
println("Sum of squared errors = " + cost)
11
K-means (python)
# Load and parse the data
data = sc.textFile("kmeans_data.txt")
parsedData = data.map(lambda line:
array([float(x) for x in line.split(' ‘)])).cache()
!
# Build the model (cluster the data)
clusters = KMeans.train(parsedData, 2, maxIterations = 10,
runs = 1, initialization_mode = "kmeans||")
!
# Evaluate clustering by computing the sum of squared errors
def error(point):
center = clusters.centers[clusters.predict(point)]
return sqrt(sum([x**2 for x in (point - center)]))
!
cost = parsedData.map(lambda point: error(point))
.reduce(lambda x, y: x + y)
print("Sum of squared error = " + str(cost))
12
Dimension reduction

+ k-means
// compute principal components
val points: RDD[Vector] = ...
val mat = RowMatrix(points)
val pc = mat.computePrincipalComponents(20)
!
// project points to a low-dimensional space
val projected = mat.multiply(pc).rows
!
// train a k-means model on the projected data
val model = KMeans.train(projected, 10)
13
Collaborative filtering
// Load and parse the data
val data = sc.textFile("mllib/data/als/test.data")
val ratings = data.map(_.split(',') match {
case Array(user, item, rate) =>
Rating(user.toInt, item.toInt, rate.toDouble)
})
!
// Build the recommendation model using ALS
val numIterations = 20
val model = ALS.train(ratings, 1, 20, 0.01)
!
// Evaluate the model on rating data
val usersProducts = ratings.map { case Rating(user, product, rate) =>
(user, product)
}
val predictions = model.predict(usersProducts)
14
Why MLlib?
It ships with Spark as 

a standard component.
15
Community
Spark community
One of the largest open source projects in big data:!
• 190+ developers contributing (40 for MLlib)
• 50+ companies contributing
• 400+ discussions per month on the mailing list
17
Spark community
The most active project in the Hadoop ecosystem:
JIRA 30 day summary
Spark Map/Reduce YARN
created 296 30 100
resolved 203 28 69
18
Unification
Out for dinner?!
• Search for a restaurant and make a reservation.
• Start navigation.
• Food looks good? Take a photo and share.
20
Why smartphone?
Out for dinner?!
• Search for a restaurant and make a reservation. (Yellow Pages?)
• Start navigation. (GPS?)
• Food looks good? Take a photo and share. (Camera?)
21
Why MLlib?
A special-purpose device may be better at one
aspect than a general-purpose device. But the cost
of context switching is high:
• different languages or APIs
• different data formats
• different tuning tricks
22
Spark SQL + MLlib
// Data can easily be extracted from existing sources,
// such as Apache Hive.
val trainingTable = sql("""
SELECT e.action,
u.age,
u.latitude,
u.longitude
FROM Users u
JOIN Events e
ON u.userId = e.userId""")
!
// Since `sql` returns an RDD, the results of the above
// query can be easily used in MLlib.
val training = trainingTable.map { row =>
val features = Vectors.dense(row(1), row(2), row(3))
LabeledPoint(row(0), features)
}
!
val model = SVMWithSGD.train(training)
23
Streaming + MLlib
// collect tweets using streaming
!
// train a k-means model
val model: KMmeansModel = ...
!
// apply model to filter tweets
val tweets = TwitterUtils.createStream(ssc, Some(authorizations(0)))
val statuses = tweets.map(_.getText)
val filteredTweets =
statuses.filter(t => model.predict(featurize(t)) == clusterNumber)
!
// print tweets within this particular cluster
filteredTweets.print()
24
GraphX + MLlib
// assemble link graph
val graph = Graph(pages, links)
val pageRank: RDD[(Long, Double)] = graph.staticPageRank(10).vertices
!
// load page labels (spam or not) and content features
val labelAndFeatures: RDD[(Long, (Double, Seq((Int, Double)))] = ...
val training: RDD[LabeledPoint] =
labelAndFeatures.join(pageRank).map {
case (id, ((label, features), pageRank)) =>
LabeledPoint(label, Vectors.sparse(features ++ (1000, pageRank))
}
!
// train a spam detector using logistic regression
val model = LogisticRegressionWithSGD.train(training)
25
Why MLlib?
• built on Spark’s lightweight yet powerful APIs
• Spark’s open source community
• seamless integration with Spark’s other components
26
• comparable to or even better than other libraries
specialized in large-scale machine learning
Why MLlib?
In MLlib’s implementations, we care about!
• scalability
• performance
• user-friendly documentation and APIs
• cost of maintenance
27
v1.0
What’s new in MLlib
• new user guide and code examples
• API stability
• sparse data support
• regression and classification tree
• distributed matrices
• tall-and-skinny PCA and SVD
• L-BFGS
• binary classification model evaluation
29
MLlib user guide
• v0.9 (link)
• a single giant page
• mixed code snippets in different languages
• v1.0 (link)
• The content is more organized.
• Tabs are used to separate code snippets in different
languages.
• Unified API docs.
30
Code examples
Code examples that can be used as templates to
build standalone applications are included under the
“examples/” folder with sample datasets:
• binary classification (linear SVM and logistic regression)
• decision tree
• k-means
• linear regression
• naive Bayes
• tall-and-skinny PCA and SVD
• collaborative filtering
31
Code examples
MovieLensALS: an example app for ALS on MovieLens data.
Usage: MovieLensALS [options] <input>
!
--rank <value>
rank, default: 10
--numIterations <value>
number of iterations, default: 20
--lambda <value>
lambda (smoothing constant), default: 1.0
--kryo
use Kryo serialization
--implicitPrefs
use implicit preference
<input>
input paths to a MovieLens dataset of ratings
32
API stability
• Following Spark core, MLlib is guaranteeing binary
compatibility for all 1.x releases on stable APIs.
• For changes in experimental and developer APIs,
we will provide migration guide between releases.
33
Sparse data support
“large-scale sparse problems”
Sparse datasets appear almost everywhere in the
world of big data, where the sparsity may come from
many sources, e.g.,
• feature transformation: 

one-hot encoding, interaction, and bucketing,
• large feature space: n-grams,
• missing data: rating matrix,
• low-rank structure: images and signals.
35
Sparsity is almost everywhere
The Netflix Prize:!
• number of users: 480,189
• number of movies: 17,770
• number of observed ratings: 100,480,507
• sparsity = 1.17%
36
Sparsity is almost everywhere
rcv1.binary (test)!
• number of examples: 677,399
• number of features: 47,236
• sparsity: 0.15%
• storage: 270GB (dense) or 600MB (sparse)
37
Exploiting sparsity
• In Spark v1.0, MLlib adds support for sparse input
data in Scala, Java, and Python.
• MLlib takes advantage of sparsity in both storage
and computation in
• linear methods (linear SVM, logistic regression, etc)
• naive Bayes,
• k-means,
• summary statistics.
38
Sparse representation of features
39
dense : 1. 0. 0. 0. 0. 0. 3.
sparse :
8
><
>:
size : 7
indices : 0 6
values : 1. 3.
Exploiting sparsity in k-means
Training set:!
• number of examples: 12 million
• number of features: 500
• sparsity: 10%
!
Not only did we save 40GB of storage by switching to the sparse
format, but we also received a 4x speedup.
dense sparse
storage 47GB 7GB
time 240s 58s
40
Implementation of sparse k-means
Algorithm:!
• For each point, find its closest center.



• Update cluster centers.
li = arg min
j
kxi cjk2
2
cj =
P
i,li=j xj
P
i,li=j 1
41
Implementation of sparse k-means
The points are usually sparse, but the centers are most likely to be
dense. Computing the distance takes O(d) time. So the time
complexity is O(n d k) per iteration. We don’t take any advantage of
sparsity on the running time. However, we have
kx ck2
2 = kxk2
2 + kck2
2 2hx, ci
Computing the inner product only needs non-zero elements. So we
can cache the norms of the points and of the centers, and then only
need the inner products to obtain the distances. This reduce the
running time to O(nnz k + d k) per iteration.
42
Exploiting sparsity in linear methods
The essential part of the computation in a gradient-
based method is computing the sum of gradients. For
linear methods, the gradient is sparse if the feature
vector is sparse. In our implementation, the total
computation cost is linear on the input sparsity.
43
Acknowledgement
For sparse data support, special thanks go to!
• David Hall (breeze),
• Sam Halliday (netlib-java),
• and many others who joined the discussion.
44
Decision tree
Decision tree
• a major feature introduced in MLlib v1.0
• supports both classification and regression
• supports both continuous features and categorical
features
• scales well on massive datasets
46
Decision tree
Come to to learn more:!
• Scalable distributed decision trees in Spark MLlib
• Manish Made, Hirakendu Das, 

Evan Sparks, Ameet Talwalkar
47
More features to explore
• distributed matrices
• tall-and-skinny PCA and SVD
• L-BFGS
• binary classification model evaluation
48
Beyond v1.0
Next release (v1.1)
Spark has a 3-month release cycle:!
• July 25: cut-off for new features
• August 1: QA period starts
• August 15+: RC’s and voting
50
MLlib roadmap for v1.1
• Standardize interfaces!
• problem / formulation / algorithm / parameter set / model
• towards building an ecosystem
• Train multiple models in parallel!
• large-scale vs. medium-scale
• Statistical analysis!
• more descriptive statistics
• more sampling and bootstrapping
• hypothesis testing
51
MLlib roadmap for v1.1
• Learning algorithms!
• Non-negative matrix factorization (NMF)
• Sparse SVD
• Multiclass support in decision tree
• Latent Dirichlet allocation (LDA)
• …
• Optimization algorithms!
• Alternating direction method of multipliers (ADMM)
• Accelerated gradient methods
52
Beyond v1.1?
What to expect from the community (and maybe you):!
• scalable implementations of well known and well
understood machine learning algorithms
• user-friendly documentation and consistent APIs
• better integration with Spark SQL, Streaming, and GraphX
• addressing practical machine learning pipelines
53
Contributors
Ameet Talwalkar, Andrew Tulloch, Chen Chao, Nan Zhu, DB Tsai,
Evan Sparks, Frank Dai, Ginger Smith, Henry Saputra, Holden
Karau, Hossein Falaki, Jey Kottalam, Cheng Lian, Marek
Kolodziej, Mark Hamstra, Martin Jaggi, Martin Weindel, Matei
Zaharia, Nick Pentreath, Patrick Wendell, Prashant Sharma,
Reynold Xin, Reza Zadeh, Sandy Ryza, Sean Owen, Shivaram
Venkataraman, Tor Myklebust, Xiangrui Meng, Xinghao Pan,
Xusen Yin, Jerry Shao, Sandeep Singh, Ryan LeCompte
54
Interested?
• : http://spark-summit.org
• Website: http://spark.apache.org
• Tutorials: http://ampcamp.berkeley.edu
• Github: https://github.com/apache/spark
• Mailing lists: user@spark.apache.org

dev@spark.apache.org
55

Weitere ähnliche Inhalte

Was ist angesagt?

Mahout scala and spark bindings
Mahout scala and spark bindingsMahout scala and spark bindings
Mahout scala and spark bindings
Dmitriy Lyubimov
 
Co-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and SparkCo-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and Spark
sscdotopen
 
Wapid and wobust active online machine leawning with Vowpal Wabbit
Wapid and wobust active online machine leawning with Vowpal Wabbit Wapid and wobust active online machine leawning with Vowpal Wabbit
Wapid and wobust active online machine leawning with Vowpal Wabbit
Antti Haapala
 
Bringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to MahoutBringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to Mahout
sscdotopen
 

Was ist angesagt? (20)

Mahout scala and spark bindings
Mahout scala and spark bindingsMahout scala and spark bindings
Mahout scala and spark bindings
 
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
 
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
 
Co-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and SparkCo-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and Spark
 
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalOverview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache Spark
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
 
D1L5 Visualization (D1L2 Insight@DCU Machine Learning Workshop 2017)
D1L5 Visualization (D1L2 Insight@DCU Machine Learning Workshop 2017)D1L5 Visualization (D1L2 Insight@DCU Machine Learning Workshop 2017)
D1L5 Visualization (D1L2 Insight@DCU Machine Learning Workshop 2017)
 
Wapid and wobust active online machine leawning with Vowpal Wabbit
Wapid and wobust active online machine leawning with Vowpal Wabbit Wapid and wobust active online machine leawning with Vowpal Wabbit
Wapid and wobust active online machine leawning with Vowpal Wabbit
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with Spark
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity Clusters
 
H2O World - GLM - Tomas Nykodym
H2O World - GLM - Tomas NykodymH2O World - GLM - Tomas Nykodym
H2O World - GLM - Tomas Nykodym
 
Kx for wine tasting
Kx for wine tastingKx for wine tasting
Kx for wine tasting
 
Transfer Learning (D2L4 Insight@DCU Machine Learning Workshop 2017)
Transfer Learning (D2L4 Insight@DCU Machine Learning Workshop 2017)Transfer Learning (D2L4 Insight@DCU Machine Learning Workshop 2017)
Transfer Learning (D2L4 Insight@DCU Machine Learning Workshop 2017)
 
Bringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to MahoutBringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to Mahout
 
Machine Learning in q/kdb+ - Teaching KDB to Read Japanese
Machine Learning in q/kdb+ - Teaching KDB to Read JapaneseMachine Learning in q/kdb+ - Teaching KDB to Read Japanese
Machine Learning in q/kdb+ - Teaching KDB to Read Japanese
 
Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)
Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)
Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)
 
Deep Learning for Computer Vision: Optimization (UPC 2016)
Deep Learning for Computer Vision: Optimization (UPC 2016)Deep Learning for Computer Vision: Optimization (UPC 2016)
Deep Learning for Computer Vision: Optimization (UPC 2016)
 

Andere mochten auch

Generalized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui MengGeneralized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
Spark Summit
 
Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...
Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...
Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...
Spark Summit
 
Computational Advertising in Yelp Local Ads
Computational Advertising in Yelp Local AdsComputational Advertising in Yelp Local Ads
Computational Advertising in Yelp Local Ads
soupsranjan
 
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth LoganMulti Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
Spark Summit
 
Building a Recommendation Engine Using Diverse Features by Divyanshu Vats
Building a Recommendation Engine Using Diverse Features by Divyanshu VatsBuilding a Recommendation Engine Using Diverse Features by Divyanshu Vats
Building a Recommendation Engine Using Diverse Features by Divyanshu Vats
Spark Summit
 
PEDIDO DE PROVIDÊNCIA 814
PEDIDO DE PROVIDÊNCIA 814PEDIDO DE PROVIDÊNCIA 814
PEDIDO DE PROVIDÊNCIA 814
vereadoreduardo
 
The sps code of conduct 2011
The sps code of conduct 2011The sps code of conduct 2011
The sps code of conduct 2011
bambangsaja
 
Information wants to be free
Information wants to be freeInformation wants to be free
Information wants to be free
Eric Tachibana
 
Recommendation Letter - Xiuting
Recommendation Letter - XiutingRecommendation Letter - Xiuting
Recommendation Letter - Xiuting
Xiuting Hao
 
Excel dad6 8
Excel dad6 8Excel dad6 8
Excel dad6 8
daalt209
 

Andere mochten auch (20)

Generalized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui MengGeneralized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
 
Training
TrainingTraining
Training
 
Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...
Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...
Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...
 
Sparse Data Support in MLlib
Sparse Data Support in MLlibSparse Data Support in MLlib
Sparse Data Support in MLlib
 
Computational Advertising in Yelp Local Ads
Computational Advertising in Yelp Local AdsComputational Advertising in Yelp Local Ads
Computational Advertising in Yelp Local Ads
 
Exploiting Ranking Factorization Machines for Microblog Retrieval
Exploiting Ranking Factorization Machines for Microblog RetrievalExploiting Ranking Factorization Machines for Microblog Retrieval
Exploiting Ranking Factorization Machines for Microblog Retrieval
 
(2016 07-19) providing click predictions in real-time at scale
(2016 07-19) providing click predictions in real-time at scale(2016 07-19) providing click predictions in real-time at scale
(2016 07-19) providing click predictions in real-time at scale
 
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth LoganMulti Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
 
Building a Recommendation Engine Using Diverse Features by Divyanshu Vats
Building a Recommendation Engine Using Diverse Features by Divyanshu VatsBuilding a Recommendation Engine Using Diverse Features by Divyanshu Vats
Building a Recommendation Engine Using Diverse Features by Divyanshu Vats
 
Presentación final
Presentación finalPresentación final
Presentación final
 
8 Truths About Exercising presented by Terry Febrey
8 Truths About Exercising presented by Terry Febrey8 Truths About Exercising presented by Terry Febrey
8 Truths About Exercising presented by Terry Febrey
 
医用画像情報イントロダクション Ver.1 0_20160726
医用画像情報イントロダクション Ver.1 0_20160726医用画像情報イントロダクション Ver.1 0_20160726
医用画像情報イントロダクション Ver.1 0_20160726
 
効果的なXPの導入を目的とした プラクティス間の相互作用の分析
効果的なXPの導入を目的とした プラクティス間の相互作用の分析効果的なXPの導入を目的とした プラクティス間の相互作用の分析
効果的なXPの導入を目的とした プラクティス間の相互作用の分析
 
Day 1 Reflection at #SXSW 2013 -- #SXSWOgilvy
Day 1 Reflection at #SXSW 2013 -- #SXSWOgilvy Day 1 Reflection at #SXSW 2013 -- #SXSWOgilvy
Day 1 Reflection at #SXSW 2013 -- #SXSWOgilvy
 
PEDIDO DE PROVIDÊNCIA 814
PEDIDO DE PROVIDÊNCIA 814PEDIDO DE PROVIDÊNCIA 814
PEDIDO DE PROVIDÊNCIA 814
 
8i standby
8i standby8i standby
8i standby
 
The sps code of conduct 2011
The sps code of conduct 2011The sps code of conduct 2011
The sps code of conduct 2011
 
Information wants to be free
Information wants to be freeInformation wants to be free
Information wants to be free
 
Recommendation Letter - Xiuting
Recommendation Letter - XiutingRecommendation Letter - Xiuting
Recommendation Letter - Xiuting
 
Excel dad6 8
Excel dad6 8Excel dad6 8
Excel dad6 8
 

Ähnlich wie Recent Developments in Spark MLlib and Beyond

Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and Beyond
DataWorks Summit
 
MLconf NYC Xiangrui Meng
MLconf NYC Xiangrui MengMLconf NYC Xiangrui Meng
MLconf NYC Xiangrui Meng
MLconf
 
MLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reducedMLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reduced
Chao Chen
 

Ähnlich wie Recent Developments in Spark MLlib and Beyond (20)

Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and Beyond
 
MLconf NYC Xiangrui Meng
MLconf NYC Xiangrui MengMLconf NYC Xiangrui Meng
MLconf NYC Xiangrui Meng
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Library
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
 
Apache Spark & MLlib
Apache Spark & MLlibApache Spark & MLlib
Apache Spark & MLlib
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Apache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemApache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating System
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced Analytics
 
Spark ml streaming
Spark ml streamingSpark ml streaming
Spark ml streaming
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
 
Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
 
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
 
Scalable Data Science in Python and R on Apache Spark
Scalable Data Science in Python and R on Apache SparkScalable Data Science in Python and R on Apache Spark
Scalable Data Science in Python and R on Apache Spark
 
Yet another intro to Apache Spark
Yet another intro to Apache SparkYet another intro to Apache Spark
Yet another intro to Apache Spark
 
MLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reducedMLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reduced
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 

Kürzlich hochgeladen

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 

Kürzlich hochgeladen (20)

OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodology
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
 
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
ManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide Deck
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 

Recent Developments in Spark MLlib and Beyond

  • 1. Recent Developments in Spark MLlib and Beyond Xiangrui Meng
  • 3. What is MLlib? MLlib is an Apache Spark component focusing on machine learning:! • initial contribution from AMPLab, UC Berkeley • shipped with Spark since version 0.8 (Sep 2013) • 40 contributors 3
  • 4. What is MLlib? Algorithms:! • classification: logistic regression, linear support vector machine (SVM), naive Bayes, classification tree • regression: generalized linear models (GLMs), regression tree • collaborative filtering: alternating least squares (ALS), non-negative matrix factorization (NMF) • clustering: k-means • decomposition: singular value decomposition (SVD), principal component analysis (PCA) 4
  • 6. scikit-learn Algorithms:! • classification: SVM, nearest neighbors, random forest, … • regression: support vector regression (SVR), ridge regression, Lasso, logistic regression, …! • clustering: k-means, spectral clustering, … • decomposition: PCA, non-negative matrix factorization (NMF), independent component analysis (ICA), … 6
  • 7. It is built on Apache Spark, a fast and general engine for large-scale data processing. • Run programs up to 100x faster than 
 Hadoop MapReduce in memory, or 
 10x faster on disk. • Write applications quickly in Java, Scala, or Python. Why MLlib? 7
  • 8. Spark philosophy Make life easy and productive for data scientists: • well documented, expressive APIs • powerful domain specific libraries • easy integration with storage systems • … and caching to avoid data movement 8
  • 9. Word count (Scala) val counts = sc.textFile("hdfs://...") .flatMap(line => line.split(" ")) .map(word => (word, 1L)) .reduceByKey(_ + _) 9
  • 10. Gradient descent val points = spark.textFile(...).map(parsePoint).cache() var w = Vector.zeros(d) for (i <- 1 to numIterations) {   val gradient = points.map { p =>     (1 / (1 + exp(-p.y * w.dot(p.x)) - 1) * p.y * p.x   ).reduce(_ + _)   w -= alpha * gradient } w w ↵ · nX i=1 g(w; xi, yi) 10
  • 11. K-means (scala) // Load and parse the data. val data = sc.textFile("kmeans_data.txt") val parsedData = data.map(_.split(‘ ').map(_.toDouble)).cache() ! // Cluster the data into two classes using KMeans. val clusters = KMeans.train(parsedData, 2, numIterations = 20) ! // Compute the sum of squared errors. val cost = clusters.computeCost(parsedData) println("Sum of squared errors = " + cost) 11
  • 12. K-means (python) # Load and parse the data data = sc.textFile("kmeans_data.txt") parsedData = data.map(lambda line: array([float(x) for x in line.split(' ‘)])).cache() ! # Build the model (cluster the data) clusters = KMeans.train(parsedData, 2, maxIterations = 10, runs = 1, initialization_mode = "kmeans||") ! # Evaluate clustering by computing the sum of squared errors def error(point): center = clusters.centers[clusters.predict(point)] return sqrt(sum([x**2 for x in (point - center)])) ! cost = parsedData.map(lambda point: error(point)) .reduce(lambda x, y: x + y) print("Sum of squared error = " + str(cost)) 12
  • 13. Dimension reduction
 + k-means // compute principal components val points: RDD[Vector] = ... val mat = RowMatrix(points) val pc = mat.computePrincipalComponents(20) ! // project points to a low-dimensional space val projected = mat.multiply(pc).rows ! // train a k-means model on the projected data val model = KMeans.train(projected, 10) 13
  • 14. Collaborative filtering // Load and parse the data val data = sc.textFile("mllib/data/als/test.data") val ratings = data.map(_.split(',') match { case Array(user, item, rate) => Rating(user.toInt, item.toInt, rate.toDouble) }) ! // Build the recommendation model using ALS val numIterations = 20 val model = ALS.train(ratings, 1, 20, 0.01) ! // Evaluate the model on rating data val usersProducts = ratings.map { case Rating(user, product, rate) => (user, product) } val predictions = model.predict(usersProducts) 14
  • 15. Why MLlib? It ships with Spark as 
 a standard component. 15
  • 17. Spark community One of the largest open source projects in big data:! • 190+ developers contributing (40 for MLlib) • 50+ companies contributing • 400+ discussions per month on the mailing list 17
  • 18. Spark community The most active project in the Hadoop ecosystem: JIRA 30 day summary Spark Map/Reduce YARN created 296 30 100 resolved 203 28 69 18
  • 20. Out for dinner?! • Search for a restaurant and make a reservation. • Start navigation. • Food looks good? Take a photo and share. 20
  • 21. Why smartphone? Out for dinner?! • Search for a restaurant and make a reservation. (Yellow Pages?) • Start navigation. (GPS?) • Food looks good? Take a photo and share. (Camera?) 21
  • 22. Why MLlib? A special-purpose device may be better at one aspect than a general-purpose device. But the cost of context switching is high: • different languages or APIs • different data formats • different tuning tricks 22
  • 23. Spark SQL + MLlib // Data can easily be extracted from existing sources, // such as Apache Hive. val trainingTable = sql(""" SELECT e.action, u.age, u.latitude, u.longitude FROM Users u JOIN Events e ON u.userId = e.userId""") ! // Since `sql` returns an RDD, the results of the above // query can be easily used in MLlib. val training = trainingTable.map { row => val features = Vectors.dense(row(1), row(2), row(3)) LabeledPoint(row(0), features) } ! val model = SVMWithSGD.train(training) 23
  • 24. Streaming + MLlib // collect tweets using streaming ! // train a k-means model val model: KMmeansModel = ... ! // apply model to filter tweets val tweets = TwitterUtils.createStream(ssc, Some(authorizations(0))) val statuses = tweets.map(_.getText) val filteredTweets = statuses.filter(t => model.predict(featurize(t)) == clusterNumber) ! // print tweets within this particular cluster filteredTweets.print() 24
  • 25. GraphX + MLlib // assemble link graph val graph = Graph(pages, links) val pageRank: RDD[(Long, Double)] = graph.staticPageRank(10).vertices ! // load page labels (spam or not) and content features val labelAndFeatures: RDD[(Long, (Double, Seq((Int, Double)))] = ... val training: RDD[LabeledPoint] = labelAndFeatures.join(pageRank).map { case (id, ((label, features), pageRank)) => LabeledPoint(label, Vectors.sparse(features ++ (1000, pageRank)) } ! // train a spam detector using logistic regression val model = LogisticRegressionWithSGD.train(training) 25
  • 26. Why MLlib? • built on Spark’s lightweight yet powerful APIs • Spark’s open source community • seamless integration with Spark’s other components 26 • comparable to or even better than other libraries specialized in large-scale machine learning
  • 27. Why MLlib? In MLlib’s implementations, we care about! • scalability • performance • user-friendly documentation and APIs • cost of maintenance 27
  • 28. v1.0
  • 29. What’s new in MLlib • new user guide and code examples • API stability • sparse data support • regression and classification tree • distributed matrices • tall-and-skinny PCA and SVD • L-BFGS • binary classification model evaluation 29
  • 30. MLlib user guide • v0.9 (link) • a single giant page • mixed code snippets in different languages • v1.0 (link) • The content is more organized. • Tabs are used to separate code snippets in different languages. • Unified API docs. 30
  • 31. Code examples Code examples that can be used as templates to build standalone applications are included under the “examples/” folder with sample datasets: • binary classification (linear SVM and logistic regression) • decision tree • k-means • linear regression • naive Bayes • tall-and-skinny PCA and SVD • collaborative filtering 31
  • 32. Code examples MovieLensALS: an example app for ALS on MovieLens data. Usage: MovieLensALS [options] <input> ! --rank <value> rank, default: 10 --numIterations <value> number of iterations, default: 20 --lambda <value> lambda (smoothing constant), default: 1.0 --kryo use Kryo serialization --implicitPrefs use implicit preference <input> input paths to a MovieLens dataset of ratings 32
  • 33. API stability • Following Spark core, MLlib is guaranteeing binary compatibility for all 1.x releases on stable APIs. • For changes in experimental and developer APIs, we will provide migration guide between releases. 33
  • 35. “large-scale sparse problems” Sparse datasets appear almost everywhere in the world of big data, where the sparsity may come from many sources, e.g., • feature transformation: 
 one-hot encoding, interaction, and bucketing, • large feature space: n-grams, • missing data: rating matrix, • low-rank structure: images and signals. 35
  • 36. Sparsity is almost everywhere The Netflix Prize:! • number of users: 480,189 • number of movies: 17,770 • number of observed ratings: 100,480,507 • sparsity = 1.17% 36
  • 37. Sparsity is almost everywhere rcv1.binary (test)! • number of examples: 677,399 • number of features: 47,236 • sparsity: 0.15% • storage: 270GB (dense) or 600MB (sparse) 37
  • 38. Exploiting sparsity • In Spark v1.0, MLlib adds support for sparse input data in Scala, Java, and Python. • MLlib takes advantage of sparsity in both storage and computation in • linear methods (linear SVM, logistic regression, etc) • naive Bayes, • k-means, • summary statistics. 38
  • 39. Sparse representation of features 39 dense : 1. 0. 0. 0. 0. 0. 3. sparse : 8 >< >: size : 7 indices : 0 6 values : 1. 3.
  • 40. Exploiting sparsity in k-means Training set:! • number of examples: 12 million • number of features: 500 • sparsity: 10% ! Not only did we save 40GB of storage by switching to the sparse format, but we also received a 4x speedup. dense sparse storage 47GB 7GB time 240s 58s 40
  • 41. Implementation of sparse k-means Algorithm:! • For each point, find its closest center.
 
 • Update cluster centers. li = arg min j kxi cjk2 2 cj = P i,li=j xj P i,li=j 1 41
  • 42. Implementation of sparse k-means The points are usually sparse, but the centers are most likely to be dense. Computing the distance takes O(d) time. So the time complexity is O(n d k) per iteration. We don’t take any advantage of sparsity on the running time. However, we have kx ck2 2 = kxk2 2 + kck2 2 2hx, ci Computing the inner product only needs non-zero elements. So we can cache the norms of the points and of the centers, and then only need the inner products to obtain the distances. This reduce the running time to O(nnz k + d k) per iteration. 42
  • 43. Exploiting sparsity in linear methods The essential part of the computation in a gradient- based method is computing the sum of gradients. For linear methods, the gradient is sparse if the feature vector is sparse. In our implementation, the total computation cost is linear on the input sparsity. 43
  • 44. Acknowledgement For sparse data support, special thanks go to! • David Hall (breeze), • Sam Halliday (netlib-java), • and many others who joined the discussion. 44
  • 46. Decision tree • a major feature introduced in MLlib v1.0 • supports both classification and regression • supports both continuous features and categorical features • scales well on massive datasets 46
  • 47. Decision tree Come to to learn more:! • Scalable distributed decision trees in Spark MLlib • Manish Made, Hirakendu Das, 
 Evan Sparks, Ameet Talwalkar 47
  • 48. More features to explore • distributed matrices • tall-and-skinny PCA and SVD • L-BFGS • binary classification model evaluation 48
  • 50. Next release (v1.1) Spark has a 3-month release cycle:! • July 25: cut-off for new features • August 1: QA period starts • August 15+: RC’s and voting 50
  • 51. MLlib roadmap for v1.1 • Standardize interfaces! • problem / formulation / algorithm / parameter set / model • towards building an ecosystem • Train multiple models in parallel! • large-scale vs. medium-scale • Statistical analysis! • more descriptive statistics • more sampling and bootstrapping • hypothesis testing 51
  • 52. MLlib roadmap for v1.1 • Learning algorithms! • Non-negative matrix factorization (NMF) • Sparse SVD • Multiclass support in decision tree • Latent Dirichlet allocation (LDA) • … • Optimization algorithms! • Alternating direction method of multipliers (ADMM) • Accelerated gradient methods 52
  • 53. Beyond v1.1? What to expect from the community (and maybe you):! • scalable implementations of well known and well understood machine learning algorithms • user-friendly documentation and consistent APIs • better integration with Spark SQL, Streaming, and GraphX • addressing practical machine learning pipelines 53
  • 54. Contributors Ameet Talwalkar, Andrew Tulloch, Chen Chao, Nan Zhu, DB Tsai, Evan Sparks, Frank Dai, Ginger Smith, Henry Saputra, Holden Karau, Hossein Falaki, Jey Kottalam, Cheng Lian, Marek Kolodziej, Mark Hamstra, Martin Jaggi, Martin Weindel, Matei Zaharia, Nick Pentreath, Patrick Wendell, Prashant Sharma, Reynold Xin, Reza Zadeh, Sandy Ryza, Sean Owen, Shivaram Venkataraman, Tor Myklebust, Xiangrui Meng, Xinghao Pan, Xusen Yin, Jerry Shao, Sandeep Singh, Ryan LeCompte 54
  • 55. Interested? • : http://spark-summit.org • Website: http://spark.apache.org • Tutorials: http://ampcamp.berkeley.edu • Github: https://github.com/apache/spark • Mailing lists: user@spark.apache.org
 dev@spark.apache.org 55