Spark is a general engine for large-scale data processing. It introduces Resilient Distributed Datasets (RDDs) which allow in-memory caching for fault tolerance and act like familiar Scala collections for distributed computation across clusters. RDDs provide a programming model with transformations like map and reduce and actions to compute results. Spark also supports streaming, SQL, machine learning, and graph processing workloads.
6. RDDs
• Resilient distributed datasets
• "read-only collection of objects partitioned
across a set of machines that can be rebuilt if a
partition is lost"
• Familiar Scala collections API for distributed data
and computation
• Monadic expression of lazy transformations, but
not monads
7. Spark Shell
• Interactive queries and prototyping
• Local, YARN, Mesos
• Static type checking and auto complete
• Lambdas
8.
9. val titles = sc.textFile("titles.txt")
val countsRdd = titles
.flatMap(tokenize)
.map(word => (cleanse(word), 1))
.reduceByKey(_ + _)
val counts = countsRdd
.filter{case(_, total) => total > 10000}
.sortBy{case(_, total) => total}
.filter{case(word, _) => word.length >= 5}
.collect
13. val sqlContext =
new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
case class Count(word: String, total: Int)
val schemaRdd =
countsRdd.map(c => Count(c._1, c._2))
val count = schemaRdd
.where('word === "scala")
.select('total)
.collect
15. registerFunction("LEN", (_: String).length)
val queryRdd = sql("
SELECT * FROM counts
WHERE LEN(word) = 10
ORDER BY total DESC
LIMIT 10
")
queryRdd
.map(c =>
s"word: ${c(0)} t| total: ${c(1)}")
.collect()
.foreach(println)
16.
17. Spark Streaming
• Realtime computation similar to Storm
• Input distributed to memory for fault tolerance
• Streaming input in to sliding windows of RDDs
• Kafka, Flume, Kinesis, HDFS
21. GraphX
• Optimally partitions and indexes vertices and
edges represented as RDDs
• APIs to join and traverse graphs
• PageRank, connected components, triangle
counting
22. val graph = Graph(userIdRDD, assocRDD)
val ranks = graph.pageRank(0.0001).vertices
val userRDD = sc.textFile("graphx/data/users.txt")
val users = userRdd.map { line =>
val fields = line.split(",")
(fields(0).toLong, fields(1))
}
val ranksByUsername = users.join(ranks).map {
case (id, (username, rank)) => (username, rank)
}
23.
24. MLib
• Machine learning library similar to Mahout
• Statistics, regression, decision trees, clustering,
PCA, gradient descent
• Iterative algorithms much faster due to in
memory caching
25. val data = sc.textFile("data.txt")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(
parts(0).toDouble,
Vectors.dense(parts(1).split(' ').map(_.toDouble))
)
}
val model = LinearRegressionWithSGD.train(
parsedData, 100
)
val valuesAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val MSE = valuesAndPreds
.map{case(v, p) => math.pow((v - p), 2)}.mean()
26. RDDs
• Resilient distributed datasets
• Familiar Scala collections API
• Distributed data and computation
• Monadic expression of transformations
• But not monads
27. Pseudo Monad
• Wraps iterator + partitions distribution
• Keeps track of history for fault tolerance
• Lazily evaluated, chaining of expressions
29. RDD Interface
• compute: transformation applied to iterable(s)
• getPartitions: partition data for parallel
computation
• getDependencies: lineage of parent RDDs and if
shuffle is required
30. HadoopRDD
• compute: read HDFS block or file split
• getPartitions: HDFS block or file split
• getDependencies: None
31. MappedRDD
• compute: compute parent and map result
• getPartitions: parent partition
• getDependencies: single dependency on parent
32. CoGroupedRDD
• compute: compute, shuffle then group parent
RDDs
• getPartitions: one per reduce task
• getDependencies: shuffle each parent RDD
33. Summary
• Simple Unified API through RDDs
• Interactive Analysis
• Hadoop Integration
• Performance