Spark offers a number of advantages over its predecessor MapReduce that make it ideal for large-scale machine learning. For example, Spark includes MLLib, a library of machine learning algorithms for large data. The presentation will cover the state of MLLib and the details of some of the scalable algorithms it includes.
2. ● Data scientist at Cloudera
● Recently lead Apache Spark development at
Cloudera
● Before that, committing on Apache Hadoop
● Before that, studying combinatorial
optimization and distributed systems at
Brown
Me
11. Two Main Problems
● Designing a system for processing huge
data in parallel
● Taking advantage of it with algorithms that
work well in parallel
12. System Requirements
● Scalability
● Programming model that abstracts away
distributed ugliness
● Data-scientist friendly
○ High-level operators
○ Interactive shell (REPL)
● Efficiency for iterative algorithms
13. CONFIDENTIAL - RESTRICTED*
MapReduce
Map Map Map Map Map Map Map Map Map Map Map Map
Reduce Reduce Reduce Reduce
Key advances by MapReduce:
•Data Locality: Automatic split computation and launch of mappers appropriately
•Fault tolerance: Write out of intermediate results and restartable mappers meant
ability to run on commodity hardware
•Linear scalability: Combination of locality + programming model that forces developers
to write generally scalable solutions to problems
14. CONFIDENTIAL - RESTRICTED*
Spark: Easy and Fast Big Data
• Easy to Develop
• Rich APIs in Java,
Scala, Python
• Interactive shell
• Fast to Run
• General execution
graphs
• In-memory storage
2-5× less code Up to 10× faster on disk,
100× in memory
15. CONFIDENTIAL - RESTRICTED*
What is Spark?
Spark is a general purpose computation framework geared towards massive
data - more flexible than MapReduce
Extra properties:
•Leverages distributed memory
•Full Directed Graph expressions for data parallel computations
•Improved developer experience
Yet retains:
Linear scalability, Fault-tolerance and Data-Locality
16. CONFIDENTIAL - RESTRICTED*
Spark introduces concept of RDD to take
advantage of memory
RDD = Resilient Distributed Datasets
•Defined by parallel transformations on data in stable storage
20. RDDs
bigfile.txt lines
val lines = sc.textFile
(“bigfile.txt”)
numbers
val numbers = lines.map
((x) => x.toDouble)
sum
numbers.sum()
21. RDDs
bigfile.txt lines
val lines = sc.textFile
(“bigfile.txt”)
numbers
Partition
Partition
Partition
Partition
Partition
Partition
HDFS
sum
Driver
val numbers = lines.map
((x) => x.toDouble) numbers.sum()
22. Shuffle
bigfile.txt lines
val lines = sc.textFile
(“bigfile.txt”)
numbers
Partition
Partition
Partition
Partition
Partition
Partition
HDFS
sum
Driver
val sorted =
lines.sort() sorted.sum()
23. CONFIDENTIAL - RESTRICTED*
Persistence and Fault Tolerance
•User decides whether and how to persist
• Disk
• Memory
• Transient (recomputed on each use)
Observation:
a.Provides fault-tolerance through concept of lineage
28. CONFIDENTIAL - RESTRICTED*
Out of the Box Functionality
• Hadoop Integration
• Works with Hadoop Data
• Runs under YARN
• Libraries
• MLlib
• Spark Streaming
• GraphX (alpha)
• Roadmap
• Language support:
• Improved Python support
• SparkR
• Java 8
• Schema support in Spark’s APIs
• Better ML
• Sparse Data Support
• Model Evaluation Framework
• Performance Testing
29. CONFIDENTIAL - RESTRICTED*
So back to ML
• Hadoop Integration
• Works with Hadoop Data
• Runs under YARN
• Libraries
•MLlib
• Spark Streaming
• GraphX (alpha)
• Roadmap
• Language support:
• Improved Python support
• SparkR
• Java 8
• Schema support in Spark’s APIs
• Better ML
• Sparse Data Support
• Model Evaluation Framework
• Performance Testing
30. Spark MLlib
Discrete Continuous
Supervised Classification
● Logistic regression (and
regularized variants)
● Linear SVM
● Naive Bayes
● Random Decision Forests
(soon)
Regression
● Linear regression (and
regularized variants)
Unsupervised Clustering
● K-means
Dimensionality reduction, matrix
factorization
● Principal component analysis /
singular value decomposition
● Alternating least squares
31. Spark MLlib
Discrete Continuous
Supervised Classification
● Logistic regression (and
regularized variants)
● Linear SVM
● Naive Bayes
● Random Decision Forests
(soon)
Regression
● Linear regression (and
regularized variants)
Unsupervised Clustering
● K-means
Dimensionality reduction, matrix
factorization
● Principal component analysis /
singular value decomposition
● Alternating least squares
32.
33.
34. Why Cluster Big Data?
● Learn the structure of your data
● Interpret new data as it relates to this
structure
41. Using it
val data = sc.textFile("kmeans_data.txt")val parsedData =
data.map( _.split(' ').map(_.toDouble))// Cluster the
data into two classes using KMeansval numIterations =
20val numClusters = 2val clusters =
KMeans.train(parsedData, numClusters, numIterations)
42. K-Means
● Alternate between two steps:
o Assign each point to a cluster based on
existing centers
o Recompute cluster centers from the
points in each cluster
43.
44.
45.
46.
47.
48. K-Means - very parallelizable
● Alternate between two steps:
o Assign each point to a cluster based on
existing centers
Process each data point independently
o Recompute cluster centers from the
points in each cluster
Average across partitions
49. // Find the sum and count of points mapping to each center
val totalContribs = data.mapPartitions { points =>
val k = centers.length
val dims = centers(0).vector.length
val sums = Array.fill(k)(BDV.zeros[Double](dims).asInstanceOf[BV[Double]])
val counts = Array.fill(k)(0L)
points.foreach { point =>
val (bestCenter, cost) = KMeans.findClosest(centers, point)
costAccum += cost
sums(bestCenter) += point.vector
counts(bestCenter) += 1
}
val contribs = for (j <- 0 until k) yield {
(j, (sums(j), counts(j)))
}
contribs.iterator
}.reduceByKey(mergeContribs).collectAsMap()
50. // Update the cluster centers and costs
var changed = false
var j = 0
while (j < k) {
val (sum, count) = totalContribs(j)
if (count != 0) {
sum /= count.toDouble
val newCenter = new BreezeVectorWithNorm(sum)
if (KMeans.fastSquaredDistance(newCenter, centers(j)) > epsilon * epsilon) {
changed = true
}
centers(j) = newCenter
}
j += 1
}
if (!changed) {
logInfo("Run " + run + " finished in " + (iteration + 1) + " iterations")
}
cost = costAccum.value
51.
52. The Problem
● K-Means is very sensitive to initial set of
center points chosen.
● Best existing algorithm for choosing centers
is highly sequential.
53.
54. K-Means++
● Start with random point from dataset
● Pick another one randomly, with probability
proportional to distance from the closest
already chosen
● Repeat until initial centers chosen